key: cord-0923534-66v367yz
authors: Moszczuk, Barbara; Krata, Natalia; Rudnicki, Witold; Foroncewicz, Bartosz; Cysewski, Dominik; Pączek, Leszek; Kaleta, Beata; Mucha, Krzysztof
title: Osteopontin—A Potential Biomarker for IgA Nephropathy: Machine Learning Application
date: 2022-03-22
journal: Biomedicines
DOI: 10.3390/biomedicines10040734
sha: 6c447f4dda810ebda79fa77f761c8891517ddc6d
doc_id: 923534
cord_uid: 66v367yz

Many potential biomarkers in nephrology have been studied, but few are currently used in clinical practice. One is osteopontin (OPN). We compared urinary OPN concentrations in 80 participants: 67 patients with various biopsy-proven glomerulopathies (GNs)—immunoglobulin A nephropathy (IgAN, 29), membranous nephropathy (MN, 20) and lupus nephritis (LN, 18) and 13 with no GN. Follow-up included 48 participants. Machine learning was used to correlate OPN with other factors to classify patients by GN type. The resulting algorithm had an accuracy of 87% in differentiating IgAN from other GNs using urinary OPN levels only. A lesser effect for discriminating MN and LN was observed. However, the lower number of patients and the phenotypic heterogeneity of MN and LN might have affected those results. OPN was significantly higher in IgAN at baseline than in other GNs and therefore might be useful for identifying patients with IgAN. That observation did not apply to either patients with IgAN at follow-up or to patients with other GNs. OPN seems to be a valuable biomarker and should be validated in future studies. Machine learning is a powerful tool that, compared with traditional statistical methods, can be also applied to smaller datasets.

According to the U.S. Centers for Disease Control and Prevention, the number of people affected with chronic kidney disease (CKD) in the United States has reached 37 million-15% of the adult population [1] . In 2018, the leading causes of end-stage kidney disease were diabetes (39%), hypertension (26%), and glomerulonephritis (15%). Those conditions can present a similar clinical picture or can overlap, necessitating the use of invasive diagnostic methods such as kidney biopsy. The need to define and implement noninvasive diagnostic markers is particularly pressing in the immune-related glomerulonephropathies (GNs), whose treatment is different from that for diabetes-or hypertension-related CKD. Efforts to create noninvasive tests that will help diagnose and monitor kidney disease have included genomic, transcriptomic, and proteomic approaches to detect gene polymorphisms [2, 3] , mRNA expression [4] , and serum and urinary proteins [5, 6] . Unfortunately, new biomarkers are not used in everyday clinical practice, mostly because of insufficient diagnostic sensitivity and specificity as demonstrated in clinical trials. Thus, the search for clinically useful biomarkers in CKD continues.

Osteopontin (OPN) is a multifunctional, extracellular phosphoprotein that is expressed in various cells and tissues, including fibroblasts, osteoblasts, macrophages, endothelial cells, adipocytes, Kupffer cells, and dendritic cells. Studies have demonstrated that OPN plays a role in the development of inflammation, wound healing, cancer metastases, diabetes, nephrolithiasis, and modulation of osteoclast function (reviewed in [7] ).

The role of OPN in glomerular diseases is not clearly defined. OPN gene polymorphisms are associated with the development of diabetic nephropathy in type 2 diabetes [8] , urinary OPN (uOPN) excretion in patients with IgA nephropathy (IgAN) [9] , and acute renal allograft rejection [10] . OPN mRNA expression in tissue is increased in areas of tubular damage [11] and in patients with renal calculi [12] . Interestingly, serum OPN has been confirmed to be a biomarker correlating with renal involvement in patients with systemic lupus erythematosus [13] and to be independently associated with the development of microalbuminuria in patients with type 1 diabetes mellitus [14] . Finally, urinary OPN is known to rise in active lupus nephritis (LN) [15] . However, OPN as a factor for discriminating between various kidney diseases has not yet been fully explored. In previous research, our team focused on the link between OPN gene polymorphisms and excretion of uOPN in patients with IgAN [9] . In the present study, we compared uOPN in various immune-related glomerulopathies. IgAN is the most common primary GN, with an incidence of 2-5 adults per 100,000 [16] . Primary membranous nephropathy (MN) is a rare disease (ORPHA number 97,560), but an important cause of proteinuria. LN is a frequent secondary autoimmune GN with variable histopathologic picture. We measured uOPN concentrations in patients with those GNs to assess the potential utility of uOPN as a biomarker. We also compared uOPN with concentrations of peroxiredoxins (PRDXs), previously studied markers of oxidative stress [5] , creating a network of biologic pathways that involve OPN to better understand the role of OPN in cells.

Our aim in the present study was to compare uOPN concentrations in patients with various GNs and to use machine learning (ML) to correlate those concentrations with clinical factors and with PRDXs.

OPN at baseline was measured in 80 participants: 67 patients-IgAN (n = 29), LN (n = 18), MN (n = 20) and 13 healthy individuals defined by the absence of any kidney disease or other chronic diseases requiring treatment. Measurement from 48 participants were available during follow-up: 43 patients-IgAN (n = 18), LN (n = 11), MN (n = 14) and 5 healthy individuals. IgAN, LN, and MN were confirmed by renal biopsy. The healthy control group consisted of age-and sex-matched volunteers. Exclusion criteria were active infection, current pregnancy, history of malignancy, or prior organ transplantation. Written informed consent was given by all study participants. Tables 1 and 2 

Urine samples (second or third morning urine) were centrifuged (10 min at 2000 rpm) within 120 min from collection, aliquoted into 2 mL cryovials, and frozen at −80 • C until use. Laboratory tests such as serum creatinine, blood morphology, urinalysis, and urinary protein were performed using routine laboratory techniques. PRDX concentrations had been obtained during a previous study of the same patient sample [5] . The eGFR was calculated using the chronic kidney disease epidemiology collaboration (CKD-EPI) equation. Body weight in kilograms was divided by the square of the height in meters (kg/m 2 ) to evaluate body mass index.

OPN was measured with the Human Osteopontin (OPN) Quantikine ELISA Kit (R&D Systems, Minneapolis, MN, USA). Urine samples were diluted 20× with assay diluent according to the manufacturer's instructions. To each well in the 96-well microplate (precoated with a monoclonal antibody specific for human OPN), 100 µL of assay diluent was added; then, 50 µL each of standard and sample were pipetted into the wells in duplicate. The microplate was then incubated for 2 h at room temperature (22-25 • C), allowing the OPN in the sample to be bound by the immobilized antibody. After incubation, any unbound substances were washed away manually using a wash buffer provided by the manufacturer and according to the assay procedure, and 200 µL of an enzyme-linked polyclonal antibody specific for human OPN was added to the wells. The plate was again incubated for 2 h at room temperature (22-25 • C). After a wash to remove any unbound antibody-enzyme reagent, 200 µL of a substrate solution was added to the wells, where color developed in proportion to the amount of OPN bound in the initial step. The color development was stopped by the addition of the stop solution included with the assay, and the optical density was subsequently measured using a BioTek PowerWave XS microplate reader (Agilent, Santa Clara, CA, USA) at a wavelength of 450 nm. To determine the OPN concentration (ng/mL), the GraphPad Prism software application (version 9.0.1: GraphPad Software, San Diego, CA, USA) used the optical density with a standard curve (4-parameter logistic equation), including extrapolation. Each result was multiplied by 20 to obtain the actual urine OPN concentration. 

The statistical analysis was performed in the GraphPad Prism (version 9.0.1) and Statistica (version 13.1, StatSoft, Tulsa, OK, USA) software applications. Results are expressed as mean ± standard deviation, median ± interquartile range, or a percentage. All variables were examined by the Shapiro-Wilk test for normal distribution. Non-normally distributed variables were analyzed using nonparametric tests. Comparisons between demographic variables were tested using the Kruskal-Wallis test and between the control and GN groups, using the Mann-Whitney U-test. Correlations between pairs of parameters were examined using Spearman's correlation analysis., The differences between categorical variables were calculated with Chi square test. The level of significance was set to p < 0.05.

We performed analysis of the data set using an approach based on the supervised machine learning algorithms. Application of machine learning (ML) allows us to perform rapid exploration of data without prior statements of detailed models and with minimal assumptions about data. ML also automatically includes interactions between variables into account. All findings of the ML approach were verified with the help of statistical analysis, and very good agreement between both methods was obtained, in particular when a strong signal was obtained. We used the ML algorithm Random Forest [17] to build a model that used standard clinical indicators, together with OPN and PRDX levels to predict each participant's classification: control, IgAN, MN, and LN. Given the available data collected, eight descriptors were available for all 80 patients: "Gndr" (gender), "BMI" (body mass index), "CR" (creatinine), "eGFR," "Hb" (hemoglobin), "PLT" (platelets), "WBC" (white blood cells), and "OPN". PRDX levels (1-5) were available for only 53 patients: 7 in the control class, 16 in the IgAN class, 12 in the LN class, and 16 in the MN class.

The analysis consisted of two steps. In the first step, the all-relevant-features selection algorithm Boruta [18] was used to find the descriptive variables carried information about the class variable. Then the Random Forest algorithm was built using only the variables not rejected by Boruta. We used the Random Forest [19] and Boruta [20] libraries in R [21] . Random Forest is a general-purpose ML algorithm for classification and nonparametric regression, widely used across multiple disciplines. It is an ensemble of decision trees. Each tree is built using a different data sample, and each split in a tree is built on a variable selected from a subset of all variables. A subset of the objects not used for the construction of a particular tree-the so-called out-of-bag objects-can be used for an unbiased estimate of the classification error and variable importance. In particular, the importance of a variable is established by measuring the decrease in the accuracy with which out-of-bag objects are classified when information about the variable under consideration is removed from the trees.

The Boruta algorithm belongs to the class of all-relevant-features selection algorithms. It is a wrapper around Random Forest. It works by extending the original set of variables by their randomized copies, so-called shadow variables. By design, the randomized copies carry no information about the decision variable. Boruta builds multiple Random Forest classifiers, each using a different set of shadow variables, and compares the importance of the original variables with the importance of the most important shadow variable (shadow max) from each set of shadow variables. The variables whose importance exceeds that of the most important shadow variable in a statistically significant way are deemed relevant. The variables that are statistically less important than the most important shadow variable are deemed irrelevant. The variables for which the test is inconclusive are called tentative. A full description of applied algorithms is available in Supplement, File S1.

Due to the randomized character of used algorithms, the results may minimally differ between each calculation.

Our results are divided into three sections. First is the analysis performed with ML. An algorithm was introduced to a set of laboratory data and biopsy-proven diagnoses from half the samples, thus "teaching" the algorithm to form diagnostic pathways (decision trees based on yes/no commands). Those decision trees were then applied to a set of data without a known diagnosis to test their accuracy (Sections 3.1-3.3). Second is a comparison of OPN levels at the approximate time of diagnosis and after treatment in all tested groups, which checked for correlations with clinical factors (Section 3.4). Third is the creation, using the information previously obtained, of a network of biologic processes that includes OPN.

Based on the provided data, the algorithm "decides" which sample matches which GN. However, not every variable has equal significance. Figure 1 shows the variables that were selected by the algorithm as important in correctly placing a patient into a given GN class. Data are tested against the shadow values created by the Boruta algorithm to establish a variable's significance.

important shadow variable are deemed irrelevant. The variables for which the test is inconclusive are called tentative. A full description of applied algorithms is available in Supplement, File S1.

Due to the randomized character of used algorithms, the results may minimally differ between each calculation.

Our results are divided into three sections. First is the analysis performed with ML. An algorithm was introduced to a set of laboratory data and biopsy-proven diagnoses from half the samples, thus "teaching" the algorithm to form diagnostic pathways (decision trees based on yes/no commands). Those decision trees were then applied to a set of data without a known diagnosis to test their accuracy (Sections 3.1-3.3). Second is a comparison of OPN levels at the approximate time of diagnosis and after treatment in all tested groups, which checked for correlations with clinical factors (Section 3.4). Third is the creation, using the information previously obtained, of a network of biologic processes that includes OPN.

Based on the provided data, the algorithm "decides" which sample matches which GN. However, not every variable has equal significance. Figure 1 shows the variables that were selected by the algorithm as important in correctly placing a patient into a given GN class. Data are tested against the shadow values created by the Boruta algorithm to establish a variable's significance. We compared the importance of the variables marked as significant by the algorithm in correctly classifying a patient to a GN class (Table 3) . A higher value indicates a higher error in patient placement when the variable is removed from the dataset. Each entry corresponds to an average decrease in the accuracy of decision trees for objects of a particular class when information concerning a given variable is withdrawn from the classifier. The last column is the average value regardless of class. Rows are sorted in descending order based on mean importance. OPN is most responsible for correct patient placement in the IgAN class. Mean eGFR is the most important variable for correct classification of healthy participants, and Hb and WBC are the most important for LN. The quality of prediction is worst for the MN class, with no variable being relevant for that class. Table 4 shows the accuracy of the algorithm based on results from 10 runs of the classifier. The algorithm was not able to correctly classify the control participants (probably because too few samples were available), but the prediction was correct most of the time for patients with IgAN (class error 0.31). 

Using the previously selected variables (OPN, WBC, Hb, eGFR, and CR), we tested our algorithm by finding patients with IgAN from among other non-IgAN samples. In Table 5 , a strong confirmation of the relevance of OPN for correctly placing a patient into the IgAN class is evident (the highest number in the column). Each entry corresponds to an average decrease in the accuracy of decision trees for objects of a particular class when information about a given variable is withdrawn from the classifier. The last column is average value regardless of class. Rows appear in descending order based on mean importance. Four other variables are relevant as well. Ten runs of the Random Forest classifier were performed. The results are nonrandom, as shown in Table 6 . OPN had the strongest prediction value for IgAN, with a low error margin: class error 0.13, which means that the algorithm was 87% correct. OPN had no predictive value for LN or MN. (Hb and WBC did but are not shown. Data available on request). Figure 2 shows the results in pictorial form. 

In previous research, our team studied PRDXs as potential biomarkers of oxidative stress in IgAN, MN, and LN [5] . We observed that the concentration of PRDXs 1-5 differed in patients with various GNs. For the present study, we added PRDXs to a whole-group analysis similar to the one described in Section 3.2. The previously used variables and PRDXs 1-5 were tested for their prediction strength in placing a patient into the correct class (Table 7) . Only two PRDXs are shown because the others failed to achieve the required accuracy (importance was measured on the reduced dataset consisting of the 53 patients for whom the additional measurements of PRDX protein levels were available).

OPN was again the key variable for correctly enrolling patients into the IgAN class. Mean eGFR was the most important variable for healthy participants, and WBC and Hb were important for predicting the LN class. However, PRDX3 was now equally as strong as WBC for the LN class, and it also appears to be relevant for the MN class (together with eGFR in the latter case). In contrast to the results presented in Table 3 , two variables are now relevant for predicting the MN class. Interestingly, PRDX4 remained classified as "tentative" (uncertain) by Boruta and it seems to contribute some information to LN class prediction. Furthermore, the classification error for MN prediction improved significantly, as shown in Table 8 . On the other hand, the predictions for healthy participants are now wrong. Figure 3 shows those results in pictorial form. Table 7 . Importance of variables that were not rejected by the Boruta algorithm for a Random Forest classifier that predicts the glomerulopathy class of the patient. 

In previous research, our team studied PRDXs as potential biomarkers of oxidative stress in IgAN, MN, and LN [5] . We observed that the concentration of PRDXs 1-5 differed in patients with various GNs. For the present study, we added PRDXs to a whole-group analysis similar to the one described in Section 3.2. The previously used variables and PRDXs 1-5 were tested for their prediction strength in placing a patient into the correct class (Table 7) . Only two PRDXs are shown because the others failed to achieve the required accuracy (importance was measured on the reduced dataset consisting of the 53 patients for whom the additional measurements of PRDX protein levels were available). OPN was again the key variable for correctly enrolling patients into the IgAN class. Mean eGFR was the most important variable for healthy participants, and WBC and Hb were important for predicting the LN class. However, PRDX3 was now equally as strong as WBC for the LN class, and it also appears to be relevant for the MN class (together with eGFR in the latter case). In contrast to the results presented in Table 3 , two variables are now relevant for predicting the MN class. Interestingly, PRDX4 remained classified as "tentative" (uncertain) by Boruta and it seems to contribute some information to LN class prediction. Furthermore, the classification error for MN prediction improved significantly, as shown in Table 8 . On the other hand, the predictions for healthy participants are now wrong. Figure 3 shows those results in pictorial form. 

As mentioned in Section 2.2, OPN was measured at two time points: shortly after diagnosis and after a mean follow-up of 27.8 months. Table 1 presents that 100% of GN patients received ACEi or angiotensin receptor blocker at both time points. There was a difference in immunosuppression treatment that was received by 71-75% of MN, 91-94% of LN, and 28-34% of IgAN patients. Figure 4 shows the values at both time points. OPN is clearly no longer a differentiating factor. 

As mentioned in Section 2.2, OPN was measured at two time points: shortly after diagnosis and after a mean follow-up of 27.8 months. Table 1 presents that 100% of GN patients received ACEi or angiotensin receptor blocker at both time points. There was a difference in immunosuppression treatment that was received by 71-75% of MN, 91-94% of LN, and 28-34% of IgAN patients. Figure 4 shows the values at both time points. OPN is clearly no longer a differentiating factor. The Spearmann correlation analysis summarized in Table 9 shows some level of association between OPN in patients with IgAN at follow-up and with PLTs. (The remaining correlations are available in Supplement, Table S1 ). The Spearmann correlation analysis summarized in Table 9 shows some level of association between OPN in patients with IgAN at follow-up and with PLTs. (The remaining correlations are available in Supplement, Table S1 ).

To investigate the potential relationship between the OPN and platelets [22, 23] , we searched the Uniport database for all human proteins annotated with the term "platelets" (PLT) receiving 1340 proteins. Using the STRING-database, we selected proteins that interacted directly or indirectly with OPN. We adopted an increased value of data reliability (term: "high confidence"). This way, we received 68 direct-interacting proteins and 500 that make up the second layer. A total of 500 proteins of the second interaction layer are the maximum number of proteins that can be indicated by this algorithm. A comparison of these two lists results in 74 proteins annotated with the term "platelets" and interact with OPN directly or via maximum one mediating protein (Supplement, File S2). Out of 74 proteins, eight are direct-interacting OPN-PLT proteins (process shown in Figure 5 ). To interpret the results more broadly, we performed functional enrichment. We set the cut-off point at FDR > 0.0001, or at the limit of the top 100 results for a given category [24, 25] . We also prepared a functional analysis of selected proteins to evaluate their role in various biologic pathways, with eight proteins being functionally analyzed in gene ontology terms (an adjusted p value < 0.0001 was considered significant). The Kyoto Encyclopedia of Genes and Genomes was used to select the major biologic pathway-based target gene [26] . A path comprises a minimum of two genes. The p value obtained from each biologic route was adjusted using the Benjamini-Hochberg false discovery rate procedure [27] . Biologic pathways with a false discovery rate less than 0.0001 were considered significant. Table 10 summarizes the selected pathways and processes. We also prepared a functional analysis of selected proteins to evaluate their role in various biologic pathways, with eight proteins being functionally analyzed in gene ontology terms (an adjusted p value < 0.0001 was considered significant). The Kyoto Encyclopedia of Genes and Genomes was used to select the major biologic pathway-based target gene [26] . A path comprises a minimum of two genes. The p value obtained from each biologic route was adjusted using the Benjamini-Hochberg false discovery rate procedure [27] . Biologic pathways with a false discovery rate less than 0.0001 were considered significant. Table 10 summarizes the selected pathways and processes. 

In our opinion, a major finding of this study is that the identification of biomarkers in nephrology might be empowered by ML. ML has recently become widely used in numerous biomedical applications ranging from the analysis of Parkinson's disease [28] , through the prediction of COVID-19 patient health [29] , to spectacularly accurate predictions of threedimensional protein structure [30] . ML methods complement the traditional statistical analysis for problems that involve complex relationships between various parameters of the studied phenomena and allow us to obtain predictive models in such situations. In particular, ML methods are widely used for identification of biomarkers [31] for diseases with complex and not-well understood mechanisms. The general idea is straightforwardif a robust predictive model can be obtained for the process under scrutiny, the variables that are used by the model are necessarily connected to this process, even if we currently do not understand why and how they are connected. Such variables can be then used as biomarkers. Moreover, they also can foster understanding by focusing experimental effort.

Random Forest [31] was used in the present work as both a classifier and an engine for the feature selection study goals. In the thorough review of 179 classifiers from 17 families, performed on 121 data sets, classifiers from the Random Forest family have shown the best and the most robust performance [31, 32] . The all-relevant feature selection algorithm Boruta [31] , a wrapper using Random Forest, was used for identification of relevant biomarkers. It was tested on a wide range of problems, and several recently published real-world datasets showed that the algorithm is both sensitive and selective [33] .

In nephrology, ML has been applied to the prediction of IgAN progression to endstage kidney disease, identification of diabetic and nondiabetic renal disorders, assessment of acute kidney injury, and dialysis-associated death [34] . In the present study, we used ML algorithms to select relevant variables from a clinical dataset, and we then applied them to distinguish various GN classes. Even applied to small groups, the algorithm correctly identified patients with IgAN, as confirmed by biopsy and a standard Mann-Whitney U-test. Interestingly, OPN was observed to be more specific than PRDX in the selected subgroup, though it is unlikely to be able to serve as a single biomarker, given that heightened levels are seen in various conditions.

The data on OPN levels in various GNs are conflicting. In children, higher OPN levels were found in patients with IgAN and focal segmental glomerulosclerosis than in those with IgAN and minimal change disease [35] . In adults, OPN levels in those with MN and minimal change disease were normal or even reduced in those with IgAN [36] . Gang et al. [36] attributed their finding to the presence of thrombin-cleaved OPN fragments undetectable in their measurements. Building on that hypothesis, Kitagori et al. compared full and cleaved OPN (N fragments) in patients with LN, diabetic nephropathy, IgAN, and minimal change disease, finding no difference in full uOPN levels between the groups and increased levels of N fragments in patients with LN and diabetic nephropathy [37] .

Our study focused on full-length uOPN but included samples from adult patients taken at two different time points. During their long follow-up, most patient received treatment with angiotensin-converting enzyme inhibitors and/or immunosuppressants, which might have influenced the results. Remission of disease, progressive fibrosis, or variance in the site of damage (glomerular vs. interstitial) could be factors responsible for discrepancies in OPN levels at baseline and follow-up [38] . However, given that control (protocol) biopsies were not performed, that hypothesis cannot be proved. The other explanation of this observation can be that uOPN is elevated only in the active form of IgAN and normalized to the same level as in other GN such as MN or LN, after the treatment. If it is true, OPN could become a very simple biomarker to be used, e.g., in an outpatient clinic, confirming active IgAN to be treated. This finding of our study must be validated.

OPN is involved in many metabolic pathways, as shown in Figure 6 and Table 10 . Because of the correlation with platelets, we have narrowed our variables to those two, and we have selected several proteins that might be goals for further investigation. Some of the functional pathways-for example, those connected to infectious diseases, particularly leishmaniosis-were also noted in another study concerning risk loci for IgAN [39] . So far, no single biomarker is likely to be diagnostic of a single GN, but in our opinion, a panel might bring the sensitivity and specificity that have long been sought.

Because of the correlation with platelets, we have narrowed our variables to those two, and we have selected several proteins that might be goals for further investigation. Some of the functional pathways-for example, those connected to infectious diseases, particularly leishmaniosis-were also noted in another study concerning risk loci for IgAN [39] . So far, no single biomarker is likely to be diagnostic of a single GN, but in our opinion, a panel might bring the sensitivity and specificity that have long been sought. Our study has several limitations. First, each GN was represented by a small sample. As our patient database evolves, those numbers will increase. For now, this research can be considered a preliminary study aimed at creating more interest in the topic. Second, further studies should include more proteins/markers or even the whole serum/urinary proteome. Third, OPN levels are likely to be linked to a histologic process and not to a specific GN. Correlation between OPN and a proteomic tissue analysis would be a valuable contribution and should be included in prospective studies. However, the need for the protocol to obtain control biopsies in patients during remission might be problematic. We did not correlate the OPN levels with kidney biopsy results for a few reasons: (1) each of studied glomerulonephritis has their own and completely different classification; (2) so, from this point of view the group was highly heterogenic, and (3) some biopsies were performed and evaluated in other centers by other pathologists that could influence the results. We summarized the biopsies result in Supplement, Tables S2-S4. We have not correlated uOPN levels with serum OPN levels or other biochemical results because the significance of OPN in other diseases is yet to be determined, and such a correlation effort Our study has several limitations. First, each GN was represented by a small sample. As our patient database evolves, those numbers will increase. For now, this research can be considered a preliminary study aimed at creating more interest in the topic. Second, further studies should include more proteins/markers or even the whole serum/urinary proteome. Third, OPN levels are likely to be linked to a histologic process and not to a specific GN. Correlation between OPN and a proteomic tissue analysis would be a valuable contribution and should be included in prospective studies. However, the need for the protocol to obtain control biopsies in patients during remission might be problematic. We did not correlate the OPN levels with kidney biopsy results for a few reasons: (1) each of studied glomerulonephritis has their own and completely different classification; (2) so, from this point of view the group was highly heterogenic, and (3) some biopsies were performed and evaluated in other centers by other pathologists that could influence the results. We summarized the biopsies result in Supplement, Tables S2-S4. We have not correlated uOPN levels with serum OPN levels or other biochemical results because the significance of OPN in other diseases is yet to be determined, and such a correlation effort would have unnecessarily complicated the analysis. In many patients, we also did not have access to certain data concerning previous history such as medications taken before study inclusion. Therefore, we decided against including these data in the study.

Given a growing burden of CKD, biomarker identification and validation have become an emerging issue. In our opinion, OPN should be included in further studies as a potential biomarker in nephrology. Based on our results, we are sure that ML should become a standard in biomarker research. As a supplement to ML, proteome databases can help place results in the context of numerous biologic pathways, pointing toward proteins that could be the next step in nephrology biomarker research.

Supplementary Materials: The following supporting information can be downloaded at https:// www.mdpi.com/article/10.3390/biomedicines10040734/s1, File S1: machine learning scripts applied in the R software; File S2: list of selected proteins from the STRING database; Table S1 : correlations of OPN with clinical parameters and long-term clinical follow-up; Tables S2-S4: histopathological classification of study participants.

Chronic Kidney Disease in the United States, 2021; US Department of Health and Human Services

The genetic architecture of membranous nephropathy and its potential to improve non-invasive diagnosis

Quantitative analyses of osteopontin mRNA expression in human proximal tubules isolated from renal biopsy tissue sections of minimal change nephrotic syndrome and IgA glomerulonephropathy patients

Peroxiredoxins as Markers of Oxidative Stress in IgA Nephropathy, Membranous Nephropathy and Lupus Nephritis

From Research Bench to Personalized Care

The multiple functions and mechanisms of osteopontin

Association between Osteopontin Promoter Gene Polymorphisms and Haplotypes with Risk of Diabetic Nephropathy

Osteopontin Gene Polymorphism and Urinary OPN Excretion in Patients with Immunoglobulin A Nephropathy

Prediction of acute renal allograft rejection by combined HLA-G 14-bp insertion/deletion genotype analysis and detection of kidney injury molecule-1 and osteopontin in the peripheral blood

Evaluation of urinary biomarkers for early detection of acute kidney injury in a rat nephropathy model

Significance of TRPV5 and OPN biomarker levels in clinical diagnosis of patients with early urinary calculi

Osteopontin and Disease Activity in Patients with Recent-onset Systemic Lupus Erythematosus: Results from the SLICC Inception Cohort

Biomedicines 2022

Osteopontin is a strong predictor of incipient diabetic nephropathy, cardiovascular disease, and all-cause mortality in patients with type 1 diabetes

The role of osteopontin as a candidate biomarker of renal involvement in systemic lupus erythematosus

The incidence of primary glomerulonephritis worldwide: A systematic review of the literature

Random Forests

Boruta-A System for Feature Selection

Classification and Regression by RandomForest

Feature Selection with the Boruta Package

R: A Language and Environment for Statistical Computing

The activation state of alphavbeta 3 regulates platelet and lymphocyte adhesion to intact and thrombin-cleaved osteopontin

The STRING database in 2021: Customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets

UniProt: The universal protein knowledgebase in 2021

Kyoto encyclopedia of genes and genomes

Encyclopedia

Predicting optimal deep brain stimulation parameters for Parkinson's disease using functional MRI and machine learning

COVID-19 Patient Health Prediction Using Boosted Random Forest Algorithm. Front. Public Health

Highly accurate protein structure prediction with AlphaFold

Applications of machine learning in drug discovery and development

Do we need hundreds of classifiers to solve real world classification problems?

Machine learning in nephrology: Scratching the surface

Urinary OPN excretion in children with glomerular proteinuria

Reduced urinary excretion of intact osteopontin in patients with IgA nephropathy

Cleaved Form of Osteopontin in Urine as a Clinical Marker of Lupus Nephritis

Osteopontin plays a critical role in interstitial fibrosis but not glomerular sclerosis in diabetic nephropathy

Discovery of new risk loci for IgA nephropathy implicates genes involved in immunity against intestinal pathogens

The routine laboratory tests were performed by the diagnostic laboratory at Infant Jesus Clinical Hospital, University Medical Center, Medical University of Warsaw, during each patient's routine visits to the Nephrology and Transplantation outpatient clinic.

The authors have no conflict to declare.