key: cord-0711764-nwhelro7
authors: Hong, Ye; Flinkman, Dani; Suomi, Tomi; Pietilä, Sami; James, Peter; Coffey, Eleanor; Elo, Laura L
title: PhosPiR: an automated phosphoproteomic pipeline in R
date: 2021-12-08
journal: Brief Bioinform
DOI: 10.1093/bib/bbab510
sha: 1896b604f512f9a0799d007285197c56b9924184
doc_id: 711764
cord_uid: nwhelro7

Large-scale phosphoproteome profiling using mass spectrometry (MS) provides functional insight that is crucial for disease biology and drug discovery. However, extracting biological understanding from these data is an arduous task requiring multiple analysis platforms that are not adapted for automated high-dimensional data analysis. Here, we introduce an integrated pipeline that combines several R packages to extract high-level biological understanding from large-scale phosphoproteomic data by seamless integration with existing databases and knowledge resources. In a single run, PhosPiR provides data clean-up, fast data overview, multiple statistical testing, differential expression analysis, phosphosite annotation and translation across species, multilevel enrichment analyses, proteome-wide kinase activity and substrate mapping and network hub analysis. Data output includes graphical formats such as heatmap, box-, volcano- and circos-plots. This resource is designed to assist proteome-wide data mining of pathophysiological mechanism without a need for programming knowledge.

Protein phosphorylation is a reversible post-translational modification (PTM) catalyzed by protein kinases of the transferase class [1] . Since its discovery in 1932, numerous studies have highlighted the importance of phosphorylation as a central regulatory process in cells [2] . High numbers of often tightly interconnected phosphoproteins participate in cell signaling and all aspects of cellular function from proliferation and differentiation to metabolism and neurotransmission, to name a few [3] [4] [5] . Phosphorylation is the most abundant 'signaling' PTM exceeding ubiquitination, methylation and acetylation [6] . To understand how complex phosphorylation changes, especially shifts introduced by pathophysiological states coordinate function, systems-level phosphoproteomics study becomes necessary [6] . Advanced mass spectrometry methods enable high-throughput measurement of phosphoproteomes [7] , however traditional downstream analysis does little beyond phosphopeptide identification and quantification. Recent developments in R packages have taken advantage of protein phosphorylation databases and annotation advances, thereby supporting the creation of an analysis tool that can better exploit phosphopeptide data.

Here we introduce PhosPiR, a pipeline which takes advantage of available open-source tools for a complete downstream analysis of mass spectrometry-derived data after phosphopeptide identification. No programing knowledge is required to run the pipeline. Our workflow consists of peptide quality control, data overviewing with histogram, boxplot, heatmap, and principal component analysis (PCA) plots, data annotation utilizing UniProt and Ensembl database, differential expression analysis including four statistical test options and post hoc testing, phosphosite translation across species, four enrichment analyses for phosphoproteins, post-translational modification set enrichment analysis (PTM-SEA) for phosphopeptide, kinase analysis, network analysis and hub analysis ( Figure 1 ). The pipeline is accompanied by video tutorials and we exemplify the functionality of the tool using previously published large-scale phosphoproteomic study of circadian clock changes in synaptoneuroscomes [8] .

The current pipeline marks version 1.1.

PhosPiR accepts output files directly from MaxQuant or preprocessed files from similar MS data processing platforms that provide peptide sequence identification and intensity data on PTMs [9] . The input file from MaxQuant for PhosPiR is the 'Phospho (STY)Sites.txt' file from the 'combined/txt' folder within MaxQuant. In case another spectra analysis tool is preferred instead of MaxQuant, such as Progenesis, Spectronaut, openMS or PEAKS, the user should select 'Other' for data input. The input format for this option is explained in Supplementary Table 1A, and demonstrated with an  example input file in Supplementary Table 1B. The steps for user data input are also outlined in the user support videos found at https://youtu.be/c7n7yE0DMsA (short setup introduction video), and https://youtu.be/n4EagNo xusI (long pipeline run demonstration and result file introduction video). To start the workflow, simply run the 'run.R' script in R program with version 4.0.3 or above. R can be downloaded from https://www.r-proje ct.org/

If the input file is 'Phospho (STY)Sites.txt' directly from MaxQuant, the pipeline will automatically remove MaxQuant-marked reverse sequences and potential contaminants and sum the intensities for each phosphosite entry. However, if the user wishes to keep track of possibly falsely labeled potential contaminant sequences, or have the analysis done on individual phosphosite intensities separated by sequence phosphocount number for each site (similar to the 'expand site' option in Perseus), a 'Special Input Generation' step can be selected, where details can be found in Supplementary Table 1C . This is specific for MaxQuant output and is not included as part of the analysis in the 'All steps' choice of the analysis run option.

Filtering and normalization steps are implemented as data preprocessing. It is important to understand the data and make educated choices here to bring forth the most reliable result. Rows and/or columns with excessive missing values can be filtered out here. The user can also choose not to filter. For normalization, there are three choices, normalize and impute, only normalize, or neither. proBatch package [10] and MSImpute package [11] are utilized for normalization and imputation, respectively. Median normalization is performed when 'only normalize' is chosen. Quantile normalization and low-rank approximation imputation is performed when 'normalize and impute' is chosen. These normalization methods were chosen as they have proven successful with phosphoproteomics [8, 10, 12] . Imputation requires at least four nonmissing values per row, if the input data do not satisfy this requirement, the user will be forced to choose one of the other two options.

Organism information, sample group information and comparison information needed to be setup by the user. PhosPiR provides a prompt window accompanied by a guide for the setup process. For organism selection, the user should highlight an organism from the organism list, or if the organism is not available from the list, select 'Other' from the list, then input the scientific name of the source organism. A few analysis steps are only available for human, mouse or rat data.

For sample group information, the user could set it up either by inputting an information file or by setting up groups within the pipeline. For the first option, a template of the information file is included in Supplementary Table 2 . Both .csv and .xlsx formats are accepted by PhosPiR. For the second option, the user will be asked how many group classifications are found in the data, for each classification, a brief description is entered, e.g. treatment or genotype, followed by group order. Group order is recorded from the user's selection of sample names in a list of all sample names within the dataset for each group. After the user has selected samples for group 1, the pipeline asks if the current group classification contains another group, upon pressing 'yes', the list of all sample names will be displayed again, this time with all previously selected samples marked with their respective group numbers. The selection process repeats until all groups from the current group classification are setup. Multiple group comparisons can also be setup by providing an information file (Supplementary Table 2 ) or going through the pipeline process. In case of setting up in PhosPiR, the user chooses the group classification first, and then enters, e.g. '1,2' to specify the groups that should be compared against each other. The process can be repeated for as many group comparisons are wished. Lastly, the user can select whether to run all analysis steps, or only special input generation/annotation/overview figure/differential analysis step. After setup, PhosPiR will run the analyses automatically, however, between analyses, step-specific choices will be given to the user, and the pipeline will not continue until the user has responded.

Several figures will be plotted automatically in order to provide an overview of the data distribution. Boxplot and histograms are suitable visuals for comparing sample distributions. Heatmaps, PCA with k-means clustering, and 3D PCA plots will automatically display the results from unsupervised clustering of the data, providing informative biological patterns from the data. Heatmap clustering utilizes Euclidean distance and completelinkage clustering. Two heatmaps are generated, one with only column (sample) clustering, and one with column and row clustering. K-means clustering on PCA plot sets the number of clusters to the reference group count number plus 1. Log 2 intensity values are automatically generated and used for overview figures. Missing values are assigned intensity of 1. A few packages are utilized for this step. Boxplot and histogram are plotted with 'ggplot2' [13] ; heatmap is plotted with 'pheatmap' [14] ; PCAs and 3D PCA are plotted with 'fingerprint' [15] , 'vegan' [16] , 'rgl' [17] , 'FactoMineR' [18] , 'factoextra' [19] , 'plot3D' [20] and 'magick' [21] .

A data annotation step utilizes information from all organisms found in the Ensembl database to identify for example reviewed accession and phosphorylation site position, Entrez ID, genome position of the protein, human ortholog genome position, accession, identify score and sequence alignment, protein pathology, expression, PTMs, subcellular location and links to publications containing information on the protein in question. For each unique protein i.d., this information is extracted from both the Ensembl and Uniprot database. For nonhuman organisms, the human ortholog information is also included for comparison. Due to the long run time, the user has the choice to opt out of UniProt and human ortholog information mining.

Nonhuman organism data usually have many unreviewed accessions within the dataset. Some databases such as PhosphoSitePlus, host site information based on Swiss-Prot accessions, and does not include unreviewed accessions. This results in difficulties matching the input phosphosite identity to the database's information. PhosPiR solves this issue by identifying the Swiss-Prot accession for the protein and aligning the sequences to generate the equivalent reviewed phosphosite position. This reviewed site information can be used for database searches by data annotation tools, thereby maximizing the identification of associated biological information. Human ortholog information allows for direct comparison of model organism data to human information. Pairwise alignment to human ortholog protein sequences allows the user to identify orthologous phosphorylation sites in human for any site of interest identified in their organism. Sequence alignment should only serve as a reference. Although it should be accurate for alignments with high identity scores, its practicality decreases as sequence identify score decreases. The following packages are utilized for this step: 'biomaRt' [22] , 'Biostrings' [23] , 'GenomicAlighments' [24] , 'protr' [25] and 'UniprotR' [26] .

Statistical tests are performed based on the group comparison setup. The user selects whether or not the data are paired in the given comparison and is offered a choice of statistical tests. For two group comparisons, fold change is automatically calculated. The fold change direction is determined by the group number, where the group with the larger number is the numerator. Group numbering should therefore take this into account. Four statistical tests T-test, Wilcoxon signed-rank test, reproducibility-optimized test statistic (ROTS) and rank product test can be chosen. T-test should not be chosen for nonparametric data. All tests can be selected if desired. Each test will yield a P-value and an FDR value for each data row. For a multiple group comparison, ANOVA with post hoc Tukey HSD Test will be performed if the groups are not paired, and linear mixed effect (LME) modeling is performed if groups are paired. Next, the user can set thresholds based on P-value or FDR for example. The significant lists will then be used as input for enrichment and network analyses. The user can choose volcano plots to visualize the statistical results. Multiple comparisons (maximum 4) can be plotted together. Based on the selected statistical cutoff, significant entries in the plot will be labeled if the number of significant changes is less than 60 in total. ROTS analysis utilizes the 'ROTS' package [27] , 'RankProd' [28] performs the rank product analysis, 'multcompView' [29] , 'lsmeans' [30] and 'nlme' [31] are ultilized for LME modeling. 'ggrepel' [32] and 'gridExtra' [33] are utilized in addition to 'ggplot2' for volcano plots.

Enrichment analysis is performed on both phosphoprotein intensity data and phosphosite data. Protein level enrichment utilizes the 'clusterProfiler' package [34] . This powerful analysis tool enables gene ontology (GO) enrichment, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment and cell marker enrichment, and for human data, disease-association enrichment. For KEGG analysis, it automatically utilizes the current latest online version of KEGG database which for example includes the COVID-19 pathway from the newest release (v.98). All the listed enrichment analyses are performed for each significant list created earlier.

Both universal background and dataset background are applied for separate analyses. Only enrichments with significant entries are recorded as a result. Phosphosite enrichment utilizes the 'PTM-SEA' (PTM-signature enrichment analysis) tool and its library PTMsigDB [35] . PTMsigDB curates detailed PTM information based on perturbation-induced site-specific changes, such as the direction of phosphorylation change upon a signaling event, and the affected signature sets that are collectively regulated. PTM-SEA analysis is performed on the entire dataset for all group comparisons. It is available for human, mouse and rat.

A kinase-substrate relationship analysis is performed with the 'KinSwingR' package [36] , on each group comparison. It predicts a kinase-substrate interactome based on a library of motifs, and then integrates the fold change direction and P-value from the statistical results, to calculate a normalized swing score. The score resembles a z-score which predicts kinases' activity changes. The Pvalue is calculated to determine the significance level based on a permutation test. PhosPiR utilizes the latest kinase information from the PhosphoSitePlus human, mouse and rat data for customized kinase library [37] instead of the kinase dataset included in KinSwingR, which has become outdated and is only available for human.

A kinase-substrate network Circos plot is automatically created with the 'circlize' package [38] which shows the top 250 significant substrates. Kinases are connected with edges to their specific substrate sites with phosphorylation sites grouped by phosphoprotein.

The phosphoprotein network is built using the 'STRINGdb' package [39] . The STRING tool uses its own proteinprotein interaction database which is used in PhosPiR to build an interaction network from each significant data list. Only interactions with greater than 0.4 confidence score (ranging from 0 to 1) are included in the STRING network figure (e.g. Supplementary Figure 1 ). The user can choose to identify hubs from within each network, and a hub interaction-enrichment score is also calculated. Hub phosphoproteins represent highly connected proteins within the network and are therefore likely to be functionally informative. The user can also choose from two cutoffs for defining hubs: the top 10% highest interactions, or an interaction count of 1 SD above the mean. The hub interaction enrichment is calculated by generating 1000 control networks for each hub and comparing hub interactions in control networks to the query network. Both P-value and FDR are presented in the result.

To demonstrate the functionality of PhosPiR, we analyzed synaptoneurosome phosphoproteome data from Brüning et al. [8] . In the original study, the authors studied the phosphorylation changes over time in synaptic terminals (otherwise known as synaptoneurosomes) from sleepdeprived mice and control mice. Here, we compared the overall differences in synaptoneurosome protein phosphorylation from mice while awake or asleep under control or sleep-deprived conditions. The original study processed the synaptoneurosome phosphopeptide data with the EasyPhos platform. The sleep/wake cycles were controlled as follows: mice were kept in a light:dark 12 hours cycle, synaptoneurosomes were taken every 4 hours (n = 4) in a single day for sleepdeprived mice and baseline mice, totaling to 48 samples [8] . For the reanalysis of this data, we downloaded the raw MS files from the PRIDE database with identifier PXD010697. Taking the original data preparation as a reference [8] , we preprocessed the raw data in a similar fashion using MaxQuant 1.6.17.0 [9] and Perseus 1.6.7.0 [40] (Supplementary Table 1 ). When inputting the preprocessed data into PhosPiR, 'Neither' was selected for normalization and imputation, as it is done outside of the pipeline.

The folder organization of the output files from Phos-PiR is shown schematically in Supplementary Figure 2 . The Group Information folder describes the group comparisons as setup by the user. Examples of group information can be found in Supplementary Tables 3  and 4 , respectively. In the Overview Figure folder , the overall data distribution is visualized in several ways including histograms, heat-maps, PCA and boxplots plots. An example is shown in Supplementary Figure 3 . The Statistical Analysis folder presents the statistical data including significance threshold for several tests as well as volcano plot output of significantly changing phosphopeptides or proteins, for each comparison and for every statistical test selected. Examples of statistical analysis output are shown in Figure 2A and Supplementary Table 5 . The Enrichment folder stores the protein-centric result on cell marker enrichment, GO enrichment and KEGG enrichment ( Figure 2B and Supplementary Table 6 ), and peptide-centric PTM-SEA enrichment result. PTM-SEA data are stored in the PhosphoSite enrichment subfolder, which includes converted. GCT input files and PTM-SEA analysis results, one per comparison, each in its own folder. Example rank plots of phosphorylation signatures can be found in Figure 2C . Kinase Analysis folder includes predicted significant kinases analyzed from KinSwingR, and resembled by swing scores, similar to z-scores for kinase activity change, and P-values, which indicate the significance of this change. Kinase activities are evaluated from the entire set rather than from the significant list. Examples of motif diagrams and kinase swing score output can be found in Figure 3 and Supplementary Table 7 , respectively. The Network folder provides interaction figures of kinases to substrates spinning off from kinase analysis (Figure 4) , also provides output from the STRING database network analysis and hub significance analysis. Examples are shown in Supplementary Figure 1 and Figure 5 . The Annotation folder includes important ID information, UniProt database information, human homolog information and sequence alignment for all proteins as well as phosphorylation sites from the dataset. An example of human homolog ID information and UniProt database information can be found in Supplementary Tables 8 and  9 , respectively.

The MS data used here [8] , incorporated intensity data for 13 634 phosphosites from which, 8386 remained after filtering. Among these, PhosPiR identified 61 known disease associated phosphorylation sites, and 256 known regulatory sites using the automatic detection annotation tools ( Supplementary Tables 10 and 11 ). The group comparisons (control versus sleep-deprived, and wake period versus sleep period with or without sleep deprivation), identified 367 significantly changing phosphorylation sites with fold change ≥2, and Rank product FDR of <0.05. These results can be seen from volcano plot and csv file output (Figure 2A, Supplementary Table 12) . Interestingly, the proteins with significantly altered phosphorylation between wake and sleep time were enriched for changes on the dopaminergic synapse pathway, as shown in Figure 2B . Thus, significant phosphopeptide changes were identified for voltage gated ion channels VGCC, VSSC and Cav2.1/2.2, and for signaling proteins PLC, PKC and CamKII ( Figure 2 ).

In the phosphosite-centric enrichment analysis, the signature set 'rapamycin' was 40% downregulated and the 'mTOR' signature set was 14% upregulated, in sleepdeprived synaptoneurosomes ( Figure 2C ), consistent with known negative regulation of mTOR by Rapamycin [41] . This demonstrates the utility of the PhosPiR pipeline to make functionally accurate predictions as the mTOR pathway is known to regulate sleepdeprivation induced responses [42, 43] . Moreover, as the PhosPiR identifies specific phosphorylation sites from these signature sets, as well as their regulatory function, where documented (Supplementary Table 13) , far more detailed mechanistic insight can be gained using PhosPiR's integrated approach than would be possible with a stand-alone phosphosite analysis. This is further supported by the PhosPiR kinase activity analysis, which uses the KinSwingR package to predict kinase activity changes (Figure 3, Supplementary Table 14) .

Examples of identified kinase substrates that undergo altered phosphosphorylation at synapses during wake versus sleep hours, or following sleep deprivation, are shown in the Circos plots in Figure 4A and B. In Figure 4A , neurofilament M (NEFM) which shows the greatest fold change in phosphorylation (−115×) can be seen to be regulated by SRC, ADRBK1 (GRK2), CSNK1D and PRKCD in synaptoneurosomes when comparing wake hours to sleep hours. Examination of the corresponding statistics file reveals that RPS6KA1 has the most increased activity among kinases based on motif phosphorylation, and PRKCZ shows the largest decrease in activity during wake hours in sleep-deprived mice based on observed phosphorylation changes on known kinase motifs ( Figure 4B ). This figure also highlights mTOR, this time as one of the kinases with the largest number of substrates that undergo altered phosphorylation in synaptoneurosomes from sleep-deprived mice. The precise substrates; adhesion G-protein coupled receptor L1 (LPHN1), cell cycle exit and neural differentiation protein 1 (CEND1), piccolo (PCLO), F-actin mono-oxygenase (MICAL3), spectrin beta chain (SPTBN1), MARCKS, GJA1 and MAP1B, and their altered phosphorylation sites, can be read directly from the plot ( Figure 4B) .

The hub analysis of protein phosphorylation in synaptoneurosomes during wake time versus sleep time has identified that the NMDA receptor subunit GRIN2B was a highly significant signaling hub ( Figure 5A ). This is consistent with several reports pointing to NMDAR in sleep regulation especially in the context of autoimmune encephalitis-induced sleep disturbance [44, 45] . Similarly, SHANK3 was highly connected to the changing phosphoproteins, consistent with its reported action in the control of circadian rhythm [46] . Synapsin I (SYN1), a neurotransmitter release regulatory protein, was also highly networked with the wake cycle phosphoproteins. SYN1 has previously been associated with synaptic changes following sleep deprivation [47] . Conversely, in sleep-deprived mice, there were fewer hubs overall consistent with the finding of Bruning et al., which showed that phosphorylation cycling was reduced upon sleep deprivation [8] . Nonetheless, synapsin I, neurofilament (NEFM) and MAPT showed increased connectivity to the regulated phosphoproteins ( Figure 5B ). MAPT phosphorylation has been shown to increase upon sleep-deprivation stress [48] . Thus, PhosPiR automated analysis identified known regulators of sleep/wake cycle and sleep-deprivation stress in synaptic terminal preparations from mouse brain. Moreover, PhosPiR provides site-specific and network information that can assist detailed parsing of mechanism.

PhosPiR also enables kinase-substrate predictions and links to kinase activity directional changes. Moreover, as all identified sites are matched to their human homolog with pairwise alignment (Supplementary Table 15 ). Analysis of pathological implications could be further confirmed from database search of aligned homolog sites. These are some of the highlights of PhosPiR. PhosPiR utilizes the KinSwingR tool to predict increases or decreases in kinase activities for defined group comparisons. PhosPiR integrates the KinSwingR tool which assesses local connectivity (swing) of kinase-substrate networks. Automated output from PhosphoPiR kinase analysis predicts regulated kinase activity based on identified substrate motifs. The final swing score is a normalized and weighted score of predicted kinase activity. Swing scores, positive and negative represent the direction of kinase activity change. Representative output tiffs are shown and accompanying csv file (ComparisonX_swingscore) is found in the Kinase analysis output folder.

PhosPiR is an automated pipeline that does not require any coding knowledge from its user. It integrates several new phosphoproteomic analysis tools such as PTM-SEA and KinSwingR into a single pipeline while it simultaneously translates phosphoproteomic data from model organisms to human in order to exploit a range of customized databases that facilitate identification of functionally relevant information.

Although all analysis steps are automatic, the pipeline retains flexibility through its setup options. The user is free to fully customize group comparisons. For example, in addition to the examples shown here, a time series Hubs are defined as proteins with interaction number > 1 SD from the mean. Hub significance is calculated from the number of interactions within the data set compared to 1000 equal sized background datasets randomly generated from the total data. The hub interaction count in the background dataset is shown as a boxplot, and interaction count (hubness) in the target network is indicated by a red star. FDR values calculated from the permutation test are indicated above the boxplots. Hubs from comparisons of sleep deprived versus control mice during wake hours, and wake hours versus sleep hours from sleep-deprived mice are shown in (A) and (B), respectively. analysis can also be included, by choosing the pairwise multigroup comparison option. All user options are presented with textboxes created using the 'svDialogs' package [49] creating a straightforward and seamless experience for the user. Although the pipeline can be applied also to nonphosphoproteomic data, several of the functions are specific to phosphoproteomics. For example, the ortholog alignment function, PTM-SEA enrichment analysis, kinase-substrate analysis and the kinase network figure generation all rely on phosphorylation-site information. Among the unique features is the ortholog alignment tool that provides a human reference site for every single phosphorylation-site from mouse or rat. An important benefit of PhosPiR is that it integrates several packages, such as PTM-SEA that utilize customized, curated libraries that contain up to date information. The integrated 'kinase analysis' function in PhosPiR not only predicts which kinases are linked to the input data, but also predicts activity changes based on the generated statistics. The 'kinase network' function clearly labels substrate phosphosite position and organizes them by protein, thereby providing a clear visual summary for significant kinase substrate changes.

Many important functions of the pipeline come from recently developed, powerful R-packages, which the pipeline unifies and provides important customizations. For example, most of the included analysis packages require strict input formats and generate diverse output formats. PhosPiR utilizes packages such as 'reshape2' [50] , 'vroom' [51] , 'openxlsx' [52] , 'textreadr' [53] , 'plyr' [54] and 'cmapR' [55] to transform data between analysis steps, so that input requirements are satisfied, and output results are unified. Output figures are expanded from the originals, modified with packages such as 'gplots' [56] , 'gridExtra' [33] and 'RColorBrewer' [57] to be informative at a glance. Many more packages are utilized and listed within Method section, we wanted to include all of them to offer their functionalities and customizations to a wider audience of nonbioinformaticians. PhosPiR also supports a wide range of organisms. With the exclusion of the disease ontology semantic enrichment analysis, kinase analysis and PTM-SEA which search only human data, whereas all other analysis steps support up to 18 organisms.

Our pipeline offers unique functionalities compared to even the most recent analysis packages such as PhosR [58] , which provides a kinase analyses toolkit for bioinformaticians working in R coding language. In contrast, the PhosPiR workf low does not require coding knowledge and can be performed by nonbioinformaticians. Furthermore, PhosPiR provides automated protein and site annotation from UniProt, Ensembl and PhosphoSitePlus. The annotation files provide information on functionality and associated pathologies. Scientific references for identified functions are included in the output. Another unique feature of the pipeline is the protein-centric network and hub analysis, which provides aligned sequence information on human homolog when for example the input data are from a model organism. Finally, the annotations accompanying the kinase enrichment function (using the PTM-SEA database) takes into account directionality of phosphorylation change when identifying pathology and regulatory signatures. These many exclusive features enable users to study their data from multiple angles and distinguishes PhosPiR from existing phosphoproteomic data analysis software.

In summary, PhosPiR is an automated pipeline that integrates a full range of analysis tools to maximize the discovery potential for phosphoproteomics datasets. PhosPiR is compatible with datasets from all MS spectral interpretation tools. In terms of compatibility, most of the analysis tools will support up to 18 different organisms for either phosphoproteomics or proteomics datasets. In terms of functionality, PhosPiR performs data preprocessing, normalization and imputation, figure overviews, statistical analysis, enrichment analysis, PTM-SEA, kinase analysis, network and hub analysis, while also carrying out annotation mining and homolog alignment. Altogether it produces over 100 result files and figures from an average dataset, without counting annotation files. In this paper, we use PhosPiR automated analysis to identify regulators of sleep/wake cycle and sleep-deprivation stress in synaptic terminal preparations from a mouse brain. This identifies known and novel regulators, thereby validating the pipeline's utility while at the same time contributing to discovery. We hope that PhosPiR will provide an opportunity for users with limited programming knowledge to experience great R packages for comprehensive functional prediction analysis with statistical support, from their phosphoproteomics data.

• PhosPiR is an automatic phosphoproteomics pipeline which does not require any programing knowledge from the users. • In a single run, PhosPiR provides peptide quality control, data overviewing with histogram, boxplot, heatmap and PCAs, data annotation utilizing UniProt and Ensembl database, differential expression analysis including four statistical test options and post hoc testing, phosphosite translation across species, four enrichment analyses for phosphoproteins, PTM-SEA (posttranslational modification set enrichment analysis) for phosphopeptide, kinase analysis, network analysis and hub analysis. • We included a variety of recently updated R package in the pipeline to offer their functionalities and customizations to a wider audience of nonbioinformaticians. • Our code and tutorial videos can be found at https://github.com/TCB-yehong/PhosPiR.

Supplementary data are available online at https://acade mic.oup.com/bib.

Example data are part of the study F. Brüning 

The PhosPiR software is available at the GitHub repository at the following site: https://github.com/TCB-yeho ng/PhosPiR. Any issues or questions during the run can be submitted to the 'Issue' tab on our GitHub page. The tutorial videos can be accessed via these links: https:// youtu.be/c7n7yE0DMsA (short setup introduction video), https://youtu.be/n4EagNoxusI (long pipeline run demonstration and result file introduction video). 

Cellular regulation by protein phosphorylation

The origins of protein phosphorylation

Integration of conventional quantitative and phospho-proteomics reveals new elements in activated Jurkat T-cell receptor pathway maintenance

Multilayered proteomics reveals molecular switches dictating ligand-dependent EGFR trafficking

Phosphorylation is a central mechanism for circadian control of metabolism and physiology

Protein phosphorylation from the perspective of systems biology

Rapid and sitespecific deep phosphoproteome profiling by data-independent acquisition without the need for spectral libraries

Sleep-wake cycles drive daily dynamics of synaptic phosphorylation

MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification

Computational challenges in biomarker discovery from high-throughput proteomic data

Imputation of Label-Free Mass Spectrometry Peptides

PaDuA: a python library for high-throughput (Phospho)proteomics data analysis

Elegant Graphics for Data Analysis

pheatmap: Pretty Heatmaps

fingerprint: Functions to Operate on Binary Fingerprint Data

3D Visualization Using OpenGL

AnRPackage for multivariate analysis

Extract and Visualize the Results of Multivariate Data Analyses

Plotting Multi-Dimensional Data

Advanced Graphics and Image-Processing in R

Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt

Biostrings: Efficient manipulation of biological strings

Software for computing and annotating genomic ranges

Protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences

UniprotR: retrieving and visualizing protein sequence and functional information from Universal Protein Resource (UniProt knowledgebase)

ROTS: an R package for reproducibility-optimized statistical testing

RankProd 2.0: a refactored Bioconductor package for detecting differentially expressed features in molecular profiling datasets

Selzer with help from Sundar Dorai-Raj. multcompView: Visualizations of Paired Comparisons

Least-squares means: the R package lsmeans

Linear and Nonlinear Mixed Effects Models

Automatically Position Non-Overlapping Text Labels with 'ggplot2', 2021. R package version 0.9

Miscellaneous Functions for Grid Graphics

clusterProfiler: an R package for comparing biological themes among gene clusters

A curated resource for phosphosite-specific signature analysis

KinSwingR: Network-Based Kinase Activity Prediction

PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse

Circlize implements and enhances circular visualization in R

STRING v11: proteinprotein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets

Perseus: a bioinformatics platform for integrative analysis of proteomics data in cancer research

Mammalian target of rapamycin inhibition

Moving to the rhythm with clock (circadian) genes, autophagy, mTOR, and SIRT1 in degenerative disease and cancer

Cellular clocks: coupled circadian and cell division cycles

NMDAR activation regulates the daily rhythms of sleep and mood

Sleep disorders in autoimmune encephalitis

Shank3 modulates sleep and expression of circadian transcription factors

Widespread changes in synaptic markers as a function of sleep and wakefulness in Drosophila

Sleep deprivation affects tau phosphorylation in human cerebrospinal fluid

Reshaping data with the reshape package

Read and Write Rectangular Text Data Quickly

openxlsx: Read, Write and Edit xlsx Files

Read Text Documents into R

The split-apply-combine strategy for data analysis

CMap Tools in R

gplots: Various R Programming Tools for Plotting Data

ColorBrewer Palettes

PhosR: A Set of Methods and Tools for Comprehensive Analysis of Phosphoproteomics Data