key: cord-0428310-q6dxf2es authors: McCool, Elijah N.; Xu, Tian; Chen, Wenrong; Beller, Nicole C.; Nolan, Scott M.; Hummon, Amanda B.; Liu, Xiaowen; Sun, Liangliang title: Qualitative and quantitative top-down proteomics of human colorectal cancer cell lines identified 23000 proteoforms and revealed drastic proteoform-level differences between metastatic and non-metastatic cancer cells date: 2021-10-28 journal: bioRxiv DOI: 10.1101/2021.10.27.466093 sha: b097804c598ff642b078436a3f235b33d3077b69 doc_id: 428310 cord_uid: q6dxf2es Understanding cancer metastasis at the proteoform level is crucial for discovering new protein biomarkers for cancer diagnosis and drug development. Proteins are the primary effectors of function in biology and proteoforms from the same gene can have drastically different biological functions. Here, we present the first qualitative and quantitative top-down proteomics (TDP) study of a pair of isogenic human metastatic and non-metastatic colorectal cancer (CRC) cell lines (SW480 and SW620). This study pursues a global view of human CRC proteome before and after metastasis in a proteoform specific manner. We identified 23,319 proteoforms of 2,297 genes from the CRC cell lines using capillary zone electrophoresis-tandem mass spectrometry (CZE-MS/MS), representing nearly one order of magnitude improvement in the number of proteoform identifications from human cell lines compared to literature data. We identified 111 proteoforms containing single amino acid variants (SAAVs) using a proteogenomic approach and revealed drastic differences between the metastatic and non-metastatic cell lines regarding SAAVs profiles. Quantitative TDP analysis unveiled statistically significant differences in proteoform abundance between the SW480 and SW620 cell lines on a proteome scale for the first time. Ingenuity Pathway Analysis (IPA) disclosed that many differentially expressed genes at the proteoform level had diversified functions and were closely related to cancer. Our study represents a milestone in TDP towards the definition of human proteome in a proteoform specific manner, which will transform basic and translational biomedical research. For TOC only Colorectal cancer (CRC) is the third most common cancer worldwide and has a high mortality rate even with recent improvements in therapies. 1,2 CRC metastasis is the main cause of CRC-related death. New insights into the molecular mechanisms of CRC metastasis will undoubtedly be beneficial for developing more effective drugs. [3] [4] [5] Extensive studies have been completed with the goal of understanding CRC metastasis at the transcriptome level, generating tremendous information about the landscape of mRNA across different stages of CRC. [6, 7] However, nucleic-acid-based measurements do not correlate with protein abundance, which are the primary effectors of function in biology. [8] Quantitative bottom-up proteomics (BUP) studies of metastatic and nonmetastatic CRC cell lines have been completed to discover new protein regulators involved in CRC metastasis. [4, 9, 10] Unfortunately, those BUP studies provided limited information of the proteoforms. These proteoforms represent all possible protein molecules derived from the same gene resulting from genetic variations, RNA alternative splicing, and protein post-translational modifications (PTMs). [11] It has been well demonstrated that proteoforms from the same gene can have drastically different biological functions. [12] Mass spectrometry (MS)-based top-down proteomics (TDP) directly measures intact proteoforms and delineates proteomes in a proteoform-specific manner. [13] TDP is invaluable for pursuing a better understanding of molecular mechanisms of cancers and discovering new proteoform biomarkers for more reliable diagnosis and drug development. [14] Here, we performed the first deep TDP study of metastatic (SW620) and nonmetastatic (SW480) human CRC cell lines, aiming to produce a comprehensive proteoform-level view of the two isogenic CRC cell lines and discover novel proteoform biomarkers of CRC metastasis. We fractionated SW480 and SW620 cell lysates using one-dimensional liquid chromatography (1D-LC, size exclusion chromatography (SEC) or reversed-phase LC (RPLC)) and 2-D LC (SEC-RPLC), followed by capillary zone electrophoresis (CZE)-tandem MS (MS/MS) analyses of all the LC fractions from both cell lines for proteoform identification (ID) and label-free quantification (LFQ), Figure 1 . The TopPIC (version 1.4.0) software was used for data analysis. [15] The experimental details are described in Supporting Information I. One long-term goal of TDP is to characterize all the millions of proteoforms in the human body. [16, 17] During the last decade, because of the improvement of proteoform sample preparation, LC and CZE separation, MS and MS/MS, about 3,000 proteoforms corresponding to roughly 1,000 genes can be identified in one TDP study from human cell lines using LC-MS/MS-based platforms, [18] [19] [20] and up to 6,000 proteoform IDs corresponding to 850 genes have been reported from an E. coli sample using a CZE-MS/MS-based workflow. [21] In this study, we collected 410 MS raw files and identified in total 23,319 proteoforms derived from 2,297 genes from the SW480 and SW620 human CRC cell lines with a 1% proteoform-level false discovery rate (FDR), representing nearly one order of magnitude improvement in the number of proteoform IDs from human cell lines, Figures 2A. Figure 2B shows a learning curve of the number of proteoform IDs from complex proteomes using TDP by comparing this study with previous works. [18] [19] [20] [21] [22] The data clearly demonstrate the power of our CZE-MS/MS-based TDP workflow for comprehensive characterization of proteoforms in complex proteome samples. We attribute the drastic improvement of proteoform IDs to the high separation efficiency of CZE for proteoforms, [23] high sensitivity of CZE-MS for proteoform detection, [23] [24] [25] and high orthogonality of LC and CZE for biomolecule separations. [21, 26] The features of CZE-MS/MS for TDP have been systematically reviewed recently. [27, 28] The list of identified proteoforms is shown in Supporting Information II. Figures 2C and 2D , the proteoforms were identified with high confidence. The average number of matched fragment ions per proteoform is nearly 20 and the average E-value of the identified proteoforms is about 1E-10. We identified 2754 proteoforms with masses in a range of 10-26 kDa and the majority of identified proteoforms had masses smaller than 10 kDa, Figure 2E . The intensity of identified proteoforms spanned seven orders of magnitude, Figure 2F , indicating the wide concentration dynamic range of proteoforms in the human cell samples. Protein PTMs modulate their biological function. For example, protein N-terminal acetylation influences the stability, folding, binding, and subcellular targeting of proteins. [29] Protein phosphorylation is well known for regulating cell signaling, gene expression, and differentiation. [30] Protein methylation plays important roles in modulating transcription. [31] This large-scale TDP study identified 4872 proteoforms with N-terminal acetylation (+42 Da mass shift), 319 proteoforms with phosphorylation [+80 Da (single phosphorylation) or +160 Da (double phosphorylation) mass shift], 321 proteoforms with methylation (+14 Da mass shift), and 241 proteoforms with oxidation (+16 Da mass shift), Figure 3A . TDP is powerful for the characterization of combinations of various PTMs on each proteoform. Here we identified 54 proteoforms with two phosphorylation sites, and identified 90 proteoforms with both acetylation and phosphorylation PTMs. Figure 3B shows the sequences and fragmentation patterns of phosphorylated and unphosphorylated proteoforms of AKAP8L (A-kinase anchor protein 8-like). AKAP8L is associated with an unfavorable prognosis in CRC based on the Human Protein Atlas (https://www.proteinatlas.org/ENSG00000011243-AKAP8L). Both the two proteoforms of AKAP8L were identified with reasonably high confidence and the phosphorylation site was localized to the S601. The S601 phosphorylation on AKAP8L was only reported in a liver phosphoproteome study using the bottom-up strategy [32] and has not been reported in the two CRC cell lines studied here according to the PhosphoSitePlus (version 6.5.9.3). [33] More studies are needed to reveal the function of phosphorylated AKAP8L at S601 in CRC. We noted that both proteoforms of AKAP8L in Figure 3B were truncated at N and C termini. Actually, in this study we identified many truncated proteoforms. Some of the truncations could be due to the enzymatic processing in the cells. To test this hypothesis, we utilized the TopFINDer tool to seek clues of enzymatic cleavages by analyzing the truncated proteoforms. [34] We determined protein cleavage activities of six enzymes with p-value better than 0.05, Figure 3C , including GRAA (Granzyme A, cleaves after K or R), [35, 36] IDE (Insulin-degrading enzyme, preferentially cleaves hydrophobic and basic residues), [37] MAP12 (Methionine aminopeptidase 1D, mitochondrial, removes the Nterminal M from proteins), MAP2 ( Methionine aminopeptidase 2, removes the N-terminal M from proteins), MEP1A (Meprin A subunit alpha, hydrolyze proteins preferentially on hydrophobic residues), and MEP1B (Meprin A subunit beta, has a preference for acidic amino acids after the cleavage site) [38] . Additionally, manual examinations of the identified proteoforms confirmed the presence of some complementary proteoform sequences to the cleaved sequences, which further improved the confidence of the enzymatic activity results. The MAP12 protein is overexpressed in CRC cells and tumors compared to normal samples and may play some important roles in CRC tumorigenesis. [39] We also analyzed over 6,000 N-terminally truncated proteoforms regarding the amino acids surrounding the truncation sites, Figure 3D . Basic amino acid residues (K and R) are highly enriched at the truncation sites (position 0), which may due to the GRAA and IDE activities. Trypsinogen has been found in various CRC cell lines and tumors, and the proteoform truncations at the K and R positions may also attribute to the activity of human trypsin in CRC cells. [40, 41] It has been demonstrated that trypsinogen has higher abundance in metastatic CRC cell lines (e.g. SW620) than non-metastatic CRC cell lines (e.g., SW480), suggesting the potential roles of trypsinogen in CRC invasion and metastasis. [40] We further analyzed the N-terminally truncated proteoforms identified from SW480 and SW620 cell lines in our SEC-CZE-MS/MS data produced in this study. We discovered that 71±0.6% (n=3) of those truncated proteoforms in SW620 cells were cleaved at K or R and the percentage was statistically significantly (t-test p-value 0.03) higher than that in SW480 cells (67±2%, n=3). One important value of TDP is its capability for delineation of various proteoforms from the same gene (proteoform family). [42] Figure 3E shows one example of Calmodulin-1 (CALM1) proteoform family. Calmodulin-1 modulates many enzymes (kinases and phosphatases), ion channels, and many other proteins by calcium-binding. We identified 75 proteoforms of CALM1. Nearly 70% of those proteoforms start at the position 2 with the N-terminal methionine removal. Various truncated proteoforms, for example, with the starting positions around 40, 60, 80 and 120, were identified in a much lower frequency. Many of those truncated proteoforms were cleaved at the basic amino acid residues (K or R). The number of proteoform spectrum matches (PrSMs) can be used to roughly estimate the relative abundance of proteoforms. [21, 43] For the CALM1 proteoforms starting containing single AAVs (SAAVs) from cancer cells. [44, 45] The Kelleher group reported the identification of 10 proteoforms containing SAAVs from breast tumor xenografts in one TDP study. [46] Here we identified 111 proteoforms containing SAAVs of 82 genes from the SW480 and SW620 cell lines with a proteogenomic approach with a 5% proteoformlevel FDR, representing one order of magnitude improvement in the number of identified proteoforms containing SAAVs compared to previous studies, Figure 3F . The transcriptomic variants based on the available RNA-Seq data were incorporated into the protein database for the identification of proteoforms containing SAAVs using TopPG, a recently developed bioinformatics tool. [47] We also manually inspected the MS/MS spectra of proteoforms containing the SAAV sites to ensure high-confidence identifications. We identified more proteoforms containing SAAVs from metastatic SW620 cells than the nonmetastatic SW480 cells (73 vs. 60). Only 20% of the 111 proteoforms were identified from both cell lines, suggesting drastic differences between the two cell lines regarding SAAVs profile. Figure 3G shows the sequences and fragmentation patterns of two examples of proteoforms containing SAAVs. TP53 is an important tumor suppressor and it has been closely related to CRC development. We identified one proteoform containing the AAV at position 72 (P R) due to the codon 72 polymorphism. Studies have shown the functional differences of the P72 and R72 proteoforms of TP53. [48] [49] [50] For example, the R72 proteoform does a markedly better job of inducing apoptosis compared to the P72 proteoform. [48, 49] In another study, the results indicated that the expression of P72 proteoform increased CRC metastasis, and R72 proteoform does not exist in the nonmetastatic CRC cell line (SW480) based on the nucleic-acid data. [50] Interestingly, we only identified the R72 proteoform of TP53 in the SW620 cell line, not in the SW480 cell line, from the top-down MS data. MSH6 is one of the DNA mismatch repair genes and its mutations play a crucial role in Lynch syndrome, which is an inherited form of CRC. We identified one MSH6 proteoform containing the SAAV due to polymorphism at position 39 (G E). The G39 has been associated with an increased risk of CRC according to the nucleic-acid data. [51] We identified G39 proteoforms of MSH6 in both SW480 and SW620 cells, but identified the E39 proteoform only in the SW480 cells, not in the SW620 cells. Comparing the proteoforms with higher expression in SW480 cells with that in SW620 cells revealed that 36 genes were overlapped, suggesting that for those 36 genes, different proteoforms of the same gene had completely different expression patterns in the two cell lines. Figure 4B shows two differentially expressed proteoforms of one of those 36 genes, DAP (Death-associated protein 1). It has been reported that DAP modulates cell death and correlates with the clinical outcome of CRC patients. [52] Interestingly, we revealed that one phosphorylated proteoform of DAP (~7,607 Da, phosphorylation site S51 or T56) had higher abundance in SW480 cells and another phosphorylated proteoform (~4,605 Da, phosphorylation site S51) showed higher expression in SW620 cells. Both the S51 and T56 are known to be phosphorylated according to PhosphoSitePlus, with S51 being the most common phosphorylation site. Those two proteoforms could have the same or closely localized phosphorylation sites, and they were cleaved at the basic amino acid residue, arginine (R), at both N and C termini. The data highlights the value of TDP for quantitative characterization of proteins in a proteoform-specific manner. We noted that the differentially expressed proteoforms in this study include phosphorylated proteoforms of several important genes related to CRC, i.e., RALY, [53] NPM1, [54] DAP, [52] and HDGF, [55] Table S1 . The functions of phosphorylated forms of those four proteins in modulating CRC development are still unclear. However, the differential expressions of those phosphorylated proteoforms in the metastatic and non-metastatic CRC cells suggest their potential roles in regulating CRC metastasis. We then performed IPA analyses of the genes of those differentially expressed proteoforms between SW480 and SW620 cells. The top five pathways that those genes are involved in are EIF2 signaling, regulation of eIF4 and p70S6K signaling, mTOR signaling, coronavirus pathogenesis pathway, and sirtuin signaling pathway, Figure 4C . Those genes are heavily involved in cancer-related diseases, for example, tumorigenesis of tissue and metastasis, Figure 4D . Five of those proteins (EIF4E, EPCAM, FKBP1A, GAA, and HSP90AB1) are drug targets. IPA network analyses revealed that 26 proteins (highlighted in purple) whose proteoforms showed higher abundance in SW480 compared to SW620 were involved in a cancer-related network (score 51), Figure 4E . Those proteins belong to several families, including enzyme (diamond shape, e.g., PARK7 and FKBP4), transcription regulator (oval shape, e.g., FUBP1), translation regulator (hexagon shape, e.g., CIRBP and EEF1A1), transporter (trapezium shape, e.g., SLC12A2 and LASP1), and other (circle shape, e.g., EPCAM and JPT1). Most of those proteins have direct (solid line) and indirect (dotted line) interactions with one another. We also carried out network analysis for the proteins whose proteoforms had higher expression in SW620 cells, and observed high-scores for cancer-related networks. Pink dots and blue dots represent proteoforms having statistically significantly higher abundance in SW480 and in SW620, respectively. The Perseus software was used for performing the t-test and generating the Volcano plot with the following settings (S0=1 and FDR = 0.05). [56] (B) Sequences and fragmentation patterns of two phosphorylated proteoforms of the gene DAP. One has higher abundance in SW480 cells and the other has higher expression in SW620 cells. (C) The top 5 Ingenuity canonical pathways for the differentially expressed genes at the proteoform level according to the IPA analysis. (D) Some of the cancer related diseases that the differentially expressed genes are involved in according to the IPA analysis. Proteoforms with higher abundance in SW480 cells (E) or higher abundance in SW620 cells (F) correspond to genes that are involved in cancerrelated networks with high scores. Those genes are highlighted in purple. The diamond, oval, hexagon, trapezium, square, and circle shapes represent enzyme, transcription regulator, translation regulator, transporter, growth factor, and other. The solid and dotted lines represent direct and indirect interactions. The inflammatory pathogenesis of colorectal cancer Colorectal Cancer Cells Enter a Diapause-like DTP State to Survive Chemotherapy Molecular Basis of Colorectal Cancer Phosphoproteomics of colon cancer metastasis: comparative mass spectrometric analysis of the isogenic primary and metastatic cell lines SW480 and SW620 Clinical potential of mass spectrometry-based proteogenomics Transcriptome analysis of human colorectal cancer biopsies reveals extensive expression correlations among genes related to cell proliferation, lipid metabolism, immune response and collagen catabolism Colorectal cancer stages transcriptome analysis Proteogenomic characterization of human colon and rectal cancer A quantitative proteomic approach of the different stages of colorectal cancer establishes OLFM4 as a new nonmetastatic tumor marker Identification of Key Players for Colorectal Cancer Metastasis by iTRAQ Quantitative Proteomics Profiling of Isogenic SW480 and SW620 Cell Lines Proteoform: a single term describing protein complexity Proteoforms as the next proteomics currency Progress in Top-Down Proteomics and the Analysis of Proteoforms Precise characterization of KRAS4b proteoforms in human colorectal cells and tumors reveals mutation/modification cross-talk TopPIC: a software tool for top-down mass spectrometrybased proteoform identification and characterization The Human Proteoform Project: A Plan to Define the Human Proteome How many human proteoforms are there? Mapping intact protein isoforms in discovery mode using top-down proteomics Large-scale top-down proteomics of the human proteome: membrane proteins, mitochondria, and senescence Identification and Characterization of Human Proteoforms by Top-Down LC-21 Tesla FT-ICR Mass Spectrometry Deep Top-Down Proteomics Using Capillary Zone Electrophoresis-Tandem Mass Spectrometry: Identification of 5700 Proteoforms from the Escherichia coli Proteome Top-Down Proteomics of Large Proteins up to 223 kDa Enabled by Serial Size Exclusion Chromatography Strategy Large-Scale Qualitative and Quantitative Top-Down Proteomics Using Capillary Zone Electrophoresis-Electrospray Ionization-Tandem Mass Spectrometry with Nanograms of Proteome Samples Comparing nanoflow reversed-phase liquid chromatography-tandem mass spectrometry and capillary zone electrophoresis-tandem mass spectrometry for top-down proteomics In-line separation by capillary electrophoresis prior to analysis by topdown mass spectrometry enables sensitive characterization of protein complexes Improved Nanoflow RPLC-CZE-MS/MS System with High Peak Capacity and Sensitivity for Nanogram Bottom-up Proteomics Recent advances (2019-2021) of capillary electrophoresis-mass spectrometry for multilevel proteomics 3 rd . Recent trends of capillary electrophoresis-mass spectrometry in proteomics research Spotlight on protein N-terminal acetylation Tackling the phosphoproteome: tools and strategies Role of protein methylation in regulation of transcription An enzyme assisted RP-RPLC approach for in-depth analysis of human liver phosphoproteome 2014: mutations, PTMs and recalibrations Proteome TopFIND 3.0 with TopFINDer and PathFINDer: database and analysis tools for the association of protein termini to pre-and post-translational events Granzyme A from cytotoxic lymphocytes cleaves GSDMB to trigger pyroptosis in target cells Crystal structure of the apoptosis-inducing human granzyme A dimer Analysis of the subsite specificity of rat insulysin using fluorogenic peptide substrates Proteomic analyses reveal an acidic prime side specificity for the astacin metalloprotease family reflected by physiological substrates MAP1D, a novel methionine aminopeptidase family member is overexpressed in colon cancer Human trypsinogen in colorectal cancer Trypsinogen expression in colorectal cancers Constructing Human Proteoform Families Using Intact-Mass and Top-Down Proteomics with a Multi-Protease Global Post-Translational Modification Discovery Database Evaluation of Spectral Counting for Relative Quantitation of Proteoforms in Top-Down Proteomics Comprehensive Detection of Single Amino Acid Variants and Evaluation of Their Deleterious Potential in a PANC-1 Cell Line Large-scale quantification of single amino-acid variations by a variation-associated database search strategy Integrated Bottom-Up and Top-Down Proteomics of Patient-Derived Breast Tumor Xenografts Proteoform Identification by Combining RNA-Seq and Top-Down Mass Spectrometry The codon 72 polymorphic variants of p53 have markedly different apoptotic potential Differential levels of transcription of p53-regulated genes by the arginine/proline polymorphism: p53 with arginine at codon 72 favors apoptosis Functional consequence of the p53 codon 72 polymorphism in colorectal cancer Polymorphism of Gly39Glu (c.116G>A) hMSH6 is associated with sporadic colorectal cancer development in the Polish population: Preliminary results Death associated protein 1 is correlated with the clinical outcome of patients with colorectal cancer and has a role in the regulation of cell death RNA-binding protein RALY reprogrammes mitochondrial metabolism via mediating miRNA processing in colorectal cancer Nucleophosmin and cancer MiR-610 inhibits cell proliferation and invasion in colorectal cancer by repressing hepatoma-derived growth factor The Perseus computational platform for comprehensive analysis of (prote)omics data The work was funded by National Cancer Institute (NCI) through the grant R01CA247863 (Sun, Hummon, and Liu). We also thank the support from National Institute of General Medical Sciences (NIGMS) through grants R01GM125991 (Sun and Liu) and The authors declare no competing interests.