Semi-supervised Predictive Clustering Trees for Multi-label Protein Subcellular Localization

Alcantara, Leonardo U.; Triguero, Isaac; Cerri, Ricardo

doi:10.1007/978-3-031-79032-4_27

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15413))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

417 Accesses
1 Citation

Abstract

Protein subcellular localization is an important classification task because the location of proteins in a cell is directly linked to their functions. Since a protein can act at two or more locations simultaneously, multi-label classification algorithms are necessary. The currently used algorithms are usually based on supervised learning, which presents some disadvantages such as (i) a need for a large amount of labeled instances for training; (ii) a waste of valuable information that labeled instances can provide; and (iii) a high cost involved in obtaining labeled instances for training. To overcome these disadvantages, semi-supervised learning can be applied, where classifiers exploit both labeled and unlabeled data. Thus, in this paper, we propose a new semi-supervised algorithm for multi-label protein subcellular localization. Our proposal is based on decision tree classifiers induced using predictive clustering trees. We investigate many semi-supervised protein subcellular localization scenarios to test whether unlabeled instances can improve the multi-label classification process. Our results show that the proposal can achieve competitive or better results when compared to the pure supervised version of the predictive clustering trees.

Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF

Human Protein Subcellular Localization with Integrated Source and Multi-label Ensemble Classifier

Article Open access 21 June 2016

Multiple Protein Subcellular Locations Prediction Based on Deep Convolutional Neural Networks with Self-Attention Mechanism

Article 23 January 2022

Predicting the Subcellular Localization of Multi-site Protein Based on Fusion Feature and Multi-label Deep Forest Model

1 Introduction

Determining the protein subcellular localization (PSL) is a crucial process for deducing and understanding protein functions and for genome annotation [26]. PSL is especially important in the medical field because it is fundamental for discovering and analyzing new kinds of medicines [21]. It can also help to identify some diseases, such as cancer or Alzheimer, due to abnormal PSL being correlated to these diseases [9].

Although experimental approaches are the most reliable methods to determine PSL, they are very costly and labor-intensive. In recent years, with the completion of many large-scale genomics and proteomics sequencing projects, a vast number of protein sequences have been discovered, making experimental approaches even more unfeasible. Thus, Machine Learning (ML) computational predictors have been used, since they can handle this problem quickly and accurately [1, 27].

In addition, proteins can move between two or more locations within a cell or reside at more than one location simultaneously. Thus, multi-label classification methods have also been developed to deal with PSL problems [12, 29, 33, 36]. All these methods use training and test datasets normally acquired only from proteins whose localizations were experimentally defined, referred to as Labeled Proteins (LPs) [6].

Although using only LPs is a well-established method in the literature, it presents some disadvantages: (i) LPs are hard to obtain because they need to be annotated with experimental approaches, which, are labor-intensive and time-consuming; (ii) these experimental protein sequences are just a small part of the overall sequences in the knowledge bases [6, 35]. About only 5.5% of all instances in UniProtKB/Swiss-Prot are LPs; and (iii) normally not all LPs retrieved from the knowledge bases are used because some of them are homologous or similar proteins to each other. This may cause overfitting, by the use of models or procedures that include more terms than are necessary or use more complicated approaches than are necessary [14]. This problem is undesirable because it leads to models that adjust themselves really well to a particular set of data but fail to fit additional data or predict future observations reliably, therefore creating weaker classifiers [6, 35].

Usually, most works in the literature use supervised ML to deal with PSL. Therefore, these works disregard proteins whose localizations are missing, referred to as Not Labeled Proteins (NLPs). However, the small number of available labeled proteins can lead to datasets with an insufficient number of instances. As an example, this is the case of the virus benchmark dataset used in the works of Wan et al. [33] and Shen and Chou [23], which has only 207 proteins, only 8 of them located in the “viral capsid”.

Supervised methods ignore the fact that NLPs can provide valuable information for the classification. For example, they contain information about the joint probability distribution over sequence elements [7, 35]. In situations where there is a large amount of unlabeled data, and the economical commitment necessary to label them all is not affordable, Semi-supervised Learning (SSL) [30] provides strategies to overcome these limitations. SSL is a learning paradigm focused on learning from both labeled and unlabeled data, mainly centered on addressing problems where labeled data is hard to obtain, and unlabeled data is abundant.

The main advantage of using learning strategies that take advantage of unlabeled data for PSL prediction is that we can construct reliable classifiers to predict the PSL of organisms or groups of organisms with very few labeled instances. This is the case, for example, of domains such as Archaea and Bacteria, groups of organisms such as Alveolata and Amoebozoa and Viruses. As an example, Table 1 presents the number of LP and NLP proteins of some domains and groups of organisms in the UniprotKB/Swiss-Prot in the 2021_03 release^{Footnote 1}.

Table 1. Comparison between LPs and NLPs in UniprotKB/Swiss-Prot

Full size table

In this paper, we propose a semi-supervised multi-label method based on the concept of Predictive Clustering Trees (PCT), a decision tree induction algorithm well-established in the multi-label classification literature [32]. We investigate many semi-supervised protein subcellular localization scenarios in datasets related to fungus, viruses, and plants in order to test whether unlabeled instances can improve the multi-label prediction of protein subcellular localization. Our results show that the proposal can achieve competitive or better results when compared to the original pure supervised version of the predictive clustering trees algorithm.

The remainder of this paper is organized as follows. Section 2 presents the main theoretical concepts needed for the development of our proposal. Section 3 presents our proposed semi-supervised PCT. Our methodology is presented in Sect. 4, while our experiments are presented and discussed in Sect. 5. Finally, Sect. 6 presents our conclusions and future work.

2 Theoretical Foundation

This section presents the main concepts necessary to develop our proposal.

2.1 Multi-label Classification

Traditional classification algorithms associate a single label/class with each instance in a dataset. However, there are many problems where we need to associate an instance with two or more labels simultaneously. In multi-label classification, an instance can be classified into two or more labels simultaneously. Documents, images, videos, music, emotions, and proteins are examples of multi-label data Given $L = \{l_j: j = 1, ..., q\}$ a finite label set, multi-label algorithms associate to each instance $X_i$ of a dataset, a set of labels $Y_i$ such that $Y_i \subseteq L$ [31].

Multi-label classification methods can be characterized into three approaches [18]: i) Problem Transformation, which transforms multi-label problems into single-label problems and then employs traditional classification algorithms; ii) Algorithm Adaptation, which creates algorithms to handle multiple labels simultaneously; and iii) Ensemble, which combines a set of multi-label classifiers to solve the problem.

In this work, our proposed semi-supervised predictive clustering tree is based on the algorithm adaptation approach since the literature has already shown the advantages of algorithm adaptation over problem transformation, such as considering label correlations [4, 32].

2.2 Semi-supervised Classification

The main goal of Semi-Supervised Classification (SSC) is to construct a model that uses labeled and unlabeled data, obtaining a better prediction performance than the one trained only with labeled data. Thus, SSC learning can be seen as a midway between supervised learning, in which we only learn from labeled instances, and unsupervised learning, in which we only learn from unlabeled instances [22].

SSC can be helpful in situations where (i) the labels of the instances are hard to obtain, requiring human specialists, special devices, slow experiments, etc., and (ii) there is a large amount of unlabeled data available. In these cases, SSL requires less human effort and can potentially improve the accuracy of the classifiers. Many domains struggle with the abovementioned problems, such as speech recognition, bioinformatics, text processing, and image categorization [19, 22].

SSC can be divided into two categories: transductive learning and inductive learning. While inductive learning focuses on building a good classifier capable of predicting unseen data accurately, transductive learning aims to predict the labels from the unlabeled instances given in the training set [30].

In this work, our proposed semi-supervised predictive clustering tree is based on inductive learning.

2.3 Predictive Clustering Trees

Predictive clustering trees (PCTs) [3] are a generalization of decision trees that can be used for clustering and classification. In regular decision trees, the leaves contain the classes, and the branches from the root to the leaves contain conditions for the classification. PCTs view a decision tree as a hierarchy of clusters in which the root of a PCT represents a cluster containing all the data and is recursively partitioned into smaller clusters using a criterion to select the best split. The leaves of a PCT correspond to the lowest level of the cluster hierarchy, and each leaf is labeled with a cluster prototype (multi-label prediction) [16].

PCTs can be constructed using a conventional top-down induction of decision trees heuristic (TDIDT) [5]. The algorithm receives a set of examples (E) as input and outputs the final PCT. To do this, it uses a heuristic (h) to select the best test ($t^*$) between the tests (t), also called possible splits. The heuristic used in PCT is the variance reduction caused by partitioning the instances from a node into many data partitions (child nodes) according to the best test ($t^*$).

To find the best test at each node, the BestTest procedure searches for the best acceptable test by calculating the variance reduction for each possible test. The procedure returns the best test found, the heuristic value, and the new partitions of the node data that will be used as child nodes. This process is repeated for every child node until no acceptable test can be found, which means that no test significantly reduces the variance of the node. Then, the algorithm creates a leaf and computes the prototype of the instances belonging to that leaf to create the predicted labels. This whole PCT induction process is shown in Algorithm 1.

Maximizing the variance reduction when splitting each node ensures that the corresponding clusters’ homogeneity increases with the depth of the tree, improving the prediction performance in general [3]. The main difference between PCTs and regular decision trees is that PCTs treat the variance and prototype functions as hyperparameters that are instantiated based on the learning task. Thus, PCTs can be used in many tasks, such as clustering [3, 25], multi-output classification and regression [2, 24], time series data analysis [10], interaction prediction [20], and hierarchical multi-label classification [32]. In our experiments, we used the Clus implementation of the PCTs [16, 32].

3 Semi-supervised Predictive Clustering Trees

Our proposal consists of adapting the Clus induction procedure to create a Semi-Supervised Predictive Clustering Tree (SSL-PCT) able to induce a clustering decision tree using labeled and unlabeled data simultaneously. Our method preserves the appealing characteristics of using a decision tree, such as being fast to learn and being readily interpretable (unlike the models learned by most semi-supervised algorithms), while making use of unlabeled data.

Some essential modifications are needed to adapt the Clus algorithm to make it semi-supervised. The first one is related to the initial set of instances (E) since now we have to modify Clus to accept labeled and unlabeled instances as input. Our dataset is now represented as $E = E_l \cup E_u $, where $E_l$ is the labeled part of the training set and $E_u$ is the unlabeled part. The second modification is related to the impurity calculation in the BestTest procedure. We must create a new heuristic that considers labeled and unlabeled data in a node. Our proposed impurity comprises two parts, a supervised and an unsupervised. A $\omega $ hyperparameter controls the influence of each part. The proposed impurity is presented in Eq. 1.

$$\begin{aligned} \text {Impurity}_{\text {SSL}}({E}) = \underbrace{ \frac{\omega }{T} . \sum _{i=1}^{ T} \text {Impurity}(E_l, Y_i)}_{\text {Supervised part}} + \underbrace{ \frac{1-\omega }{D} . \sum _{i=1}^{ D} \text { Impurity(}E, X_i)}_{\text {Unsupervised part}} \end{aligned}$$

(1)

In Eq. 1, E is the set of instances, $Y_i$ represents each of the T target features (labels), $X_i$ represents each of the D descriptive features, and $\omega $ is a value between 0 and 1 that weights the supervision and unsupervision used in the method. If $\omega = 1$, the method works using only the labeled part of the dataset, behaving as a supervised learning algorithm. On the other hand, if $\omega = 0$, the algorithm behaves as an unsupervised learning algorithm, not using the labels to define the splits. Any value between 0 and 1 means the algorithm uses labeled and unlabeled data. Recall that in Eq. 1, the whole dataset E is considered an unsupervised dataset since, regardless of the instances being labeled or not, the unsupervised part of the equation uses only descriptive features to search for the best split. What changes is the weight of the supervision employed during induction.

One of our method’s main advantages is the controlled level of supervision. When performing semi-supervised learning, the unlabeled data can always negatively affect prediction performance. By adapting the level of supervision, we can guarantee that our SSL-PCT achieves better or equal prediction performance compared to its supervised counterpart.

The main impurity measure used in the original PCT algorithm is the Gini Impurity (GI). GI is a measure used in decision trees to calculate how much a set of instances in a node is pure to find the best split. GI can be calculated for a set of data E and a specific label Y according to Eq. 2.

$$\begin{aligned} \text {Gini(}E, Y) = 1 - \sum _{i=1}^{ C} p_{i}^{2} \end{aligned}$$

(2)

In Eq. 2, C is the number of possible values for the label Y. As an example, in a binary classification, $C = 2$. Also, $p_i$ is the a priori probability of class $c_i$.

Since we now have labeled and unlabeled instances, we need to define the impurity calculation in each case. The impurity over the target features (labels) is calculated using only the labeled instances ($E_l$). It is referred to as a labeled impurity. In contrast, the impurity over the descriptive features is calculated using all the labeled and unlabeled instances (E) and is referred to as an unlabeled impurity. Equation 3 presents the labeled impurity calculation for a node of our SSL-PCT. In the equation, $Y_i$ is a label, $E_{l}^{train}$ represents the whole labeled set of instances before the root node split, and $E_l$ is a subset of instances resulting after the root node split.

$$\begin{aligned} \text {Impurity(}E_l, Y_i\text {)} = \frac{\text {Gini}(E_l, Y_i)}{\text {Gini}(E_l^{train}, Y_i)} \end{aligned}$$

(3)

The unlabeled impurity is calculated over the descriptive features, which, differently from the labels, can assume numeric or categorical values. Thus, we need to define an impurity for each case. The unlabeled impurities for a dataset E over categorical and numeric descriptive feature $X_i$ are calculated according to Eqs. 4 and 5.

$$\begin{aligned} \underbrace{\text {Impurity(}E, X_i\text {)} = \frac{\text {Gini}(E, X_i)}{\text {Gini}(E^{train}, X_i)}}_{\text {For categorical features}} \end{aligned}$$

(4)

$$\begin{aligned} \underbrace{\text {Impurity(}E, X_i\text {)} = \frac{\text {Var}(E, X_i)}{\text {Var}(E^{train}, X_i)}}_{\text {For numeric features}} \end{aligned}$$

(5)

The variance of the $i^{th}$ feature in the set E over the descriptive feature $X_i$ is calculated according to Eq. 6.

$$\begin{aligned} \text {Var}(E, X_i) = \frac{\sum _{j=1}^{ N}(X_{i}^{j})^{2} - \frac{1}{N} . (\sum _{j=1}^{ N} X_{i}^{j})^{2}}{N} \end{aligned}$$

(6)

After modifying the impurity calculations, our proposed SSL-PCT can now handle datasets containing labeled and unlabeled instances, exploiting them according to the $\omega $ weight hyperparameter.

4 Methodology

This section presents experiments, software, datasets, and evaluation measures.

Table 2. Characteristics of the datasets used in the experiments

Full size table

4.1 Clus

Clus is an open-source decision tree and rules learning system that implements the predictive clustering framework [3]. It generalizes the decision tree concept by learning trees interpreted as cluster hierarchies, called PCTs. To create a cluster during induction, Clus can be optimized using different heuristics to choose the best split in each internal node, improving prediction performance.

In this work, we used the implementation provided by Vens et al. [32], modifying it to use our proposed semi-supervised impurity function. All implementations and experiments were performed using the Java program language.

4.2 Datasets

UniProt Knowledge Base (UniProtKB) [28] is an expertly curated database, acting as a central access point for integrated protein information with cross-references to multiple sources. The knowledge base consists of UniProtKB/SwissProt and UniProtKB/TrEMBL. The former contains manual and high-quality annotations with information extracted from literature and curator-evaluated computational analysis. The latter contains computationally analyzed records enriched with automatic annotation and classification.

We extracted three UniProtKB datasets: viruses, plants, and fungi. We used data extracted from the UniProtKB/Swiss-Prot 2021_03 version. We used Gene Ontology (GO) [13] terms as features for each protein. We extracted GO terms associated with each protein from the UniProt-GOA [15] database.

The GO is a set of defined vocabularies and terms representing gene functions and products across different species. Two strategies were used to construct feature vectors: 1 and 0 values to indicate the presence or absence of GO terms and the frequency (TF) of each GO term for each protein. Therefore, we constructed six protein datasets, three with categorical descriptive features (1/0) and three with numeric descriptive features (TF).

Table 2 describes the characteristics of the datasets used in our experiments. Card refers to cardinality, the average number of labels per instance in a multi-label dataset. Dens refers to density, which is a dimensionless level of the cardinality. Equations 7 and 8 show how these two measures are calculated. In the equations, $Y_i$ represents a label of the dataset D.

In Table 2, MeanIR refers to the mean imbalance ratio of the datasets, which is a value that represents the average level of imbalance in a multi-label dataset [8]. This measure is essential because most of the multi-label datasets suffer from a high level of imbalance, making the learning task more complicated.

4.3 Evaluation Measures

To evaluate the performance of our proposed SSL-PCT, we used the multi-label evaluation measures defined in the work of Godbole and Sarawagi [11]. They are presented in Eqs. 9, 10, 11 and 12.

Consider that each multi-label instance is represented by $(\textbf{x}_i, Y_i)$ in which $i = 1 \dots m$ (with m the total number of instances). In the equations, $Y_i \subseteq L$ is the set of true labels of instance $\textbf{x}_i$, and $L = \{\lambda _j : j = 1 \dots q\}$ is the set of all labels in the problem. Given a specific instance $\textbf{x}_i$, the set of labels predicted by a classifier is represented by $Z_i$. The F1 measure in Eq. 12 is the harmonic mean of precision and recall.

5 Experiments and Discussion

We executed our experiments using a 5-fold cross-validation procedure, splitting the data into five partitions containing 20% of the original data. For each fold, the algorithm is trained using the remaining four folds, referred to as a training set, and then used to label the instances in the fold, referred to as the test set. The test set is kept aside from the training data to evaluate the inductive performance of our method. The transductive performance is also evaluated, labeling the instances of the four folds used as the training set. The final result is the average of the evaluation measures calculated for each fold.

Each training set is then divided into labeled and unlabeled portions of instances. Following the recommendation of Wang et al. [34], we do not keep class proportion in both labeled and unlabeled sets since the objective of semi-supervised learning is to exploit the use of unlabeled data to improve prediction performance. The selection of instances to be labeled is performed randomly until a specific criterion is met, and the remaining instances are used as unlabeled. In this case, the labels of the instances are removed and replaced with the symbol “?” so that our algorithm recognizes these instances as unlabeled. We ensure that every labeled portion of the training set has at least one representative instance for each class.

We used different ratios when dividing the training set to test many semi-supervised learning scenarios for each dataset. We performed as follows. Consider C as the number of classes in a dataset E. For every dataset, the labeled portion of the training set was constructed with $C \times \alpha $ instances. The $\alpha $ hyperparameter is a value that controls the number of instances in the labeled portion. We used three values for $\alpha $: 5, 10 and 20. This strategy was performed for every dataset in Table 2.

Considering the $\omega $ hyperparameter, which defines the contribution of supervised and unsupervised parts in the calculation of the impurity value (Eq. 1), we tested six different values: 0.0, 0.2, 0.4, 0.6, 0.8 and 1.0. Thus, we can evaluate if the unlabeled instances improved the prediction performance.

It is worth noting that our proposed SSL-PCT can present different performance on different datasets. This happens because semi-supervised methods are known to be domain dependent [17].

Table 3 shows the Accuracy and F1 values for all the experiments. Acc I and F1 I stand for Accuracy and F1 Inductive performances, while Acc T and F1 T stand for Accuracy and F1 Transductive performances. The best results for each dataset are highlighted in boldface to see the best performance among the variations of the $\omega $ hyperparameter. Each value is the Accuracy and F1 average values across 5-fold cross-validation. The rows containing the symbol “-” represent cases in which our heuristic could not find any acceptable test to put in the nodes of the decision trees. Thus, no tree was created in these cases. This only happened for some experiments with the numeric versions of the datasets when the $\omega $ hyperparameter was set to 0.0, i.e., when only unlabeled instances were used to calculate the impurity.

Table 3. Results in the Virus, Plant and Fungus Datasets.

Full size table

Table 4. Results in the Virus, Plant and Fungus Datasets without located_in Terms

Full size table

As can be seen from our results, in most cases, the best results were obtained when the maximum supervision level was used ($\omega = 1$), indicating that the algorithm could not improve the classification with the unlabeled data. It is possible to see that for numeric datasets, semi-supervision slightly improved the result for some cases compared to the purely supervised version. Even in these cases, the increase was not very significant. In some configurations where $\omega = 0.8$, a semi-supervised strategy could increase the performance by more than 1%.

An unusual point of these results is that the evaluation metrics returned very high values even with very few labeled instances. This is even more unusual when we see that in the experiments with $\alpha = 5$, we have approximately 30 Virus labeled instances (4.65% of the total), 60 Plant labeled instances (1.14% of the total), and 50 Fungus labeled instances (0.91% of the total).

A hypothesis raised to explain this point is that we have several types of GO terms, each one representing a concept^{Footnote 2}. As an example, we have the type involved_in, which tells if a protein is involved in some biological process, or the type enables, which tells whether a protein allows for some biological process to occur. A possible problem may be related to our features since in the cellular components’ ontology, we have annotations with the term located_in. This term appears in a protein when we have evidence that it is present in any cellular component, i.e., in this case, we have a descriptive feature representing almost the same concept of the target feature: subcellular localization. This can explain why the evaluation measures achieved high results, even with very few proteins, given that these located_in features will always be great candidates to be the best features to partition the dataset. New experiments were executed to confirm this hypothesis, removing all GO terms of the type located_in. Table 4 shows the results of these new experiments. Again, the best results are highlighted in boldface: Acc I and F1 I stand for Accuracy and F1 Inductive performances, and Acc T and F1 T stand for Accuracy and F1 Transductive performances.

From these new results, we can see that our hypothesis seems to be correct since the decision trees trained in the datasets without the located_in terms obtained much lower values for the evaluation measures. However, these values better represent the reality of the taxonomies with few proteins with annotated subcellular localizations.

6 Conclusions and Future Work

In this paper, we proposed a Semi-Supervised Predictive Clustering Tree (SSL-PCT) algorithm capable of classifying multi-label instances exploiting labeled and unlabeled data. Our proposal modifies the Clus algorithm [32], introducing a new impurity measure for the decision tree heuristic. Our new impurity uses labeled and unlabeled instances to find the best split in a decision tree node.

Our proposed algorithm is also flexible, as we have an $\omega $ hyperparameter that controls the tree’s supervision level. If $\omega = 1$, the tree acts precisely like a supervised PCT. If $\omega = 0$, the algorithm only uses the descriptive features from the labeled and unlabeled instances to construct the tree, without the information about the labels.

We applied our method to different protein subcellular localization datasets. We also compared many values of the $\omega $ hyperparameter, varying its value from 0.0 to 1.0 in steps of 0.2. We also investigated the performance of SSL-PCT when used with different amounts of labeled instances. The results show that our new method can exploit unlabeled data to help the classification in many scenarios.

Our experiments showed that the best results typically relied only on the labeled data. However, the experiments also showed that the unsupervised part of the algorithm can help improve the classification results.

In future work, we plan to expand the experiments in various ways, such as investigating more well-known multi-label datasets, better investigating the variation in the $\omega $ hyperparameter to analyze the influence of the level of supervision in the datasets, evaluating the prediction performance with more multi-label evaluation measures, and perform experiments with more variations in the proportions of labeled and unlabeled data in our datasets.

We will also study the possibility of creating semi-supervised learning random forests within our SSL-PCT. Random Forests are one of the best classification algorithms, and their ability to generate a feature ranking can help define the best features in multi-label datasets. We also plan to apply some pre-processing to the protein datasets to improve the results using unlabeled data. Since, in general, only a few features are used to construct the trees, there is an indication that we may remove some less important features in the used protein datasets.

Notes

References

Almagro Armenteros, J.J., Sønderby, C.K., Sønderby, S.K., Nielsen, H., Winther, O.: DeepLoc: prediction of protein subcellular localization using deep learning. Bioinformatics 33(21), 3387–3395 (2017)
Article MATH Google Scholar
Basgalupp, M., Cerri, R., Schietgat, L., Triguero, I., Vens, C.: Beyond global and local multi-target learning. Inf. Sci. 579, 508–524 (2021)
Article MathSciNet MATH Google Scholar
Blockeel, H., Raedt, L.D., Ramon, J.: Top-down induction of clustering trees. In: Proceedings of the Fifteenth International Conference on Machine Learning, ICML 1998, pp. 55–63. Morgan Kaufmann Publishers Inc. (1998)
Google Scholar
Bogatinovski, J., Todorovski, L., Džeroski, S., Kocev, D.: Comprehensive comparative study of multi-label classification methods. Expert Syst. Appl. 203, 117215 (2022)
Article MATH Google Scholar
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth and Brooks, Monterey (1984)
MATH Google Scholar
Cao, J., Liu, W., He, J., Gu, H.: Mining proteins with non-experimental annotations based on an active sample selection strategy for predicting protein subcellular localization. PLOS One 8, e67343 (2013)
Google Scholar
Caragea, C., Caragea, D., Silvescu, A., Honavar, V.: Semi-supervised prediction of protein subcellular localization using abstraction augmented Markov models. BMC Bioinform. 11(Suppl 8), S6 (2010)
Article MATH Google Scholar
Charte, F., Rivera, A.J., Del Jesus, M.J., Herrera, F.: Addressing imbalance in multilabel classification: measures and random resampling algorithms. Neurocomputing 163, 3–16 (2015)
Article Google Scholar
Cui, Q., JiangEmail, T., Liu, B., Ma, S.: Esub8: a novel tool to predict protein subcellular localizations in eukaryotic organisms. BMC Bioinform. 5, 66 (2004)
Article Google Scholar
Džeroski, S., Gjorgjioski, V., Slavkov, I., Struyf, J.: Analysis of time series data with predictive clustering trees. In: Džeroski, S., Struyf, J. (eds.) KDID 2006. LNCS, vol. 4747, pp. 63–80. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-75549-4_5
Chapter MATH Google Scholar
Godbole, S., Sarawagi, S.: Discriminative methods for multi-labeled classification. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 22–30. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24775-3_5
Chapter MATH Google Scholar
Guo, X., Liu, F., Ju, Y., Wang, Z., Wang, C.: Human protein subcellular localization with integrated source and multi-label ensemble classifier. Sci. Rep. 6, 28087 (2016)
Google Scholar
Harris, M.A., Clark, J., Ireland, A.: The gene ontology (go) database and informatics resource. Nucl. Acids Res 32, D258-61 (2004)
Google Scholar
Hawkins, D.M.: The Problem of Overfitting. J. Chem. Inf. Comput. Sci. 44, 1–12 (2004)
Article MATH Google Scholar
Huntley, R.P., et al.: The GOA database: gene ontology annotation updates for 2015. Nucl. Acids Res. 43, D1057–D1063 (2015)
Article MATH Google Scholar
Kocev, D., Slavkov, I., Dzeroski, S.: Feature ranking for multi-label classification using predictive clustering trees. In: Proceedings of Companion Publication of the European Conference on Machine Learning and Knowledge Discovery in Databases (2013)
Google Scholar
Levatić, J., Ceci, M., Kocev, D., DźEroski, S.: Semi-supervised classification trees. J. Intell. Inf. Syst. 49(3), 461–486 (2017)
Article MATH Google Scholar
Madjarov, G., Kocev, D., Gjorgjevikj, D., Džeroski, S.: An extensive experimental comparison of methods for multi-label learning. Pattern Recogn. 45(9), 3084–3104 (2012)
Article MATH Google Scholar
Pise, N.N., Kulkarni, P.: A survey of semi-supervised learning methods. In: 2008 International Conference on Computational Intelligence and Security, vol. 2, pp. 30–34 (2008)
Google Scholar
Pliakos, K., Vens, C.: Drug-target interaction prediction with tree-ensemble learning and output space reconstruction. BMC Bioinform. 21 (2020)
Google Scholar
Rey, S., Gardy, J.L., Brinkman, F.S.: Assessing the precision of high-throughput computational and laboratory approaches for the genome-wide identification of protein subcellular localization in bacteria. BMC Genom. 6, 162 (2005)
Article MATH Google Scholar
Sadarangani, A., Jivani, A.: A survey of semi-supervised learning. Int. J. Eng. Sci. Res. Technol. 5(10), 138–143 (2016)
MATH Google Scholar
Shen, H., Chou, K.: Virus-mPLoc: a fusion classifier for viral protein subcellular location prediction by incorporating multiple sites. J. Biomol. Struct. Dyn. 28(2), 175–186 (2010)
Article MATH Google Scholar
Struyf, J., Džeroski, S.: Constraint based induction of multi-objective regression trees. In: Bonchi, F., Boulicaut, J.-F. (eds.) KDID 2005. LNCS, vol. 3933, pp. 222–233. Springer, Heidelberg (2006). https://doi.org/10.1007/11733492_13
Chapter MATH Google Scholar
Struyf, J., Džeroski, S.: Clustering trees with instance level constraints. In: Kok, J.N., Koronacki, J., Mantaras, R.L., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 359–370. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74958-5_34
Chapter MATH Google Scholar
Su, E., Chiu, H., Lo, A., Hwang, J., Sung, T., Hsu, W.: Protein subcellular localization prediction based on compartment-specific features and structure conservation. BMC Bioinform. 8, 330 (2007)
Article MATH Google Scholar
Su, R., He, L., Liu, T., Liu, X., Wei, L.: Protein subcellular localization based on deep image features and criterion learning strategy. Brief. Bioinform. 22(4), bbaa313 (2020)
Google Scholar
The UniProt Consortium: UniProt: the universal protein knowledgebase. Nucl. Acids Res. 45(Issue D1), D158–D169 (2017)
Google Scholar
Thumuluri, V., Almagro Armenteros, J.J., Johansen, A., Nielsen, H., Winther, O.: DeepLoc 2.0: multi-label subcellular localization prediction using protein language models. Nucl. Acids Res. 50(W1), W228–W234 (2022)
Google Scholar
Triguero, I., Garcia, S., Herrera, F.: Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowl. Inf. Syst. 42(2), 245–284 (2015)
Article MATH Google Scholar
Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 667–685. Springer, Boston (2009). https://doi.org/10.1007/978-0-387-09823-4_34
Chapter Google Scholar
Vens, C., Struyf, J., Schietgat, L., Džeroski, S., Blockeel, H.: Decision trees for hierarchical multi-label classification. Mach. Learn. 73, 185–214 (2008)
Article MATH Google Scholar
Wan, S., Mak, M.W., Kung, S.Y.: mGOASVM: multi-label protein subcellular localization based on gene ontology and support vector machines. BMC Bioinform. 13, 290 (2012)
Article MATH Google Scholar
Wang, Y., Xu, X., Zhao, H., Hua, Z.: Semi-supervised learning based on nearest neighbor rule and cut edges. Knowl.-Based Syst. 23(6), 547–554 (2010)
Article MATH Google Scholar
Xu, Q., Hu, D., Xue, H., Yu, W., Yang, Q.: Semi-supervised protein subcellular localization. BMC Bioinform. 10(Suppl 1), S47 (2009)
Article MATH Google Scholar
Zhang, Q., et al.: Accurate prediction of multi-label protein subcellular localization through multi-view feature learning with RBRL classifier. Brief. Bioinform. 22(5) (2021)
Google Scholar

Download references

Acknowledgment

This study was financed by the São Paulo Research Foundation (FAPESP) grants #2016/25220-1, #2017/24807-1 and #2022/02981-8. I. Triguero is funded by a Maria Zambrano Senior Fellowship at the University of Granada. I. Triguero’s work is also supported by PID2023-149128NB-I00.

Author information

Authors and Affiliations

Department of Computing, Federal University of São Carlos, São Carlos, SP, Brazil
Leonardo U. Alcantara
Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain
Isaac Triguero
Institute of Mathematics and Computer Science, University of São Paulo, São Carlos, SP, Brazil
Ricardo Cerri

Authors

Leonardo U. Alcantara
View author publications
Search author on:PubMed Google Scholar
Isaac Triguero
View author publications
Search author on:PubMed Google Scholar
Ricardo Cerri
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Ricardo Cerri .

Editor information

Editors and Affiliations

Universidade Federal Fluminense, Niterói, Brazil
Aline Paes
Instituto Tecnológico de Aeronáutica, São José dos Campos, Brazil
Filipe A. N. Verri

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alcantara, L.U., Triguero, I., Cerri, R. (2025). Semi-supervised Predictive Clustering Trees for Multi-label Protein Subcellular Localization. In: Paes, A., Verri, F.A.N. (eds) Intelligent Systems. BRACIS 2024. Lecture Notes in Computer Science(), vol 15413. Springer, Cham. https://doi.org/10.1007/978-3-031-79032-4_27

Download citation

DOI: https://doi.org/10.1007/978-3-031-79032-4_27
Published: 30 January 2025
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-79031-7
Online ISBN: 978-3-031-79032-4
eBook Packages: Computer ScienceComputer Science (R0)

Semi-supervised Predictive Clustering Trees for Multi-label Protein Subcellular Localization

Abstract

Similar content being viewed by others

Human Protein Subcellular Localization with Integrated Source and Multi-label Ensemble Classifier

Multiple Protein Subcellular Locations Prediction Based on Deep Convolutional Neural Networks with Self-Attention Mechanism

Predicting the Subcellular Localization of Multi-site Protein Based on Fusion Feature and Multi-label Deep Forest Model

1 Introduction