1 Introduction

The area of Automated Machine Learning (AutoML) emerged to automate the tedious process of manually testing different sets of algorithms hyperparameters and other data engineering tasks involved in the process of solving a machine learning (ML) problem. AutoML methods reduce or eliminate the need of specialized human intervention in building and tuning models. Given its importance, nowadays all major tech companies like Google, Microsoft, and Amazon offer AutoML techniques on their platforms.

An ML pipeline is defined as a set of ML operations that include data preprocessing, data classification and post-processing steps, which can be executed sequentially or in parallel. Most AutoML tasks are tackled as an optimization problem, with a search space that includes a range of algorithms and hyperparameters used to build machine learning task pipelines. The search method optimizes a metric of learning quality, such as the accuracy or f-measure [11].

Different optimization methods have been used in AutoML, including Bayesian optimization, evolutionary algorithms [5] and hybrids techniques, but few have looked at how difficult this optimization problem is. One way to measure problem difficulty is to look at the fitness landscape of a problem [12]. The fitness landscape is defined by the set of viable solutions within the search space associated with the quality metric being optimized. Investigating the fitness landscape is important because it contributes to understanding the space from an optimization perspective, such as whether it has multiple global/local optima, saddle points, plateaus, or regions of low variability. Understanding these characteristics of the space leads to the construction of more effective AutoML methods, allowing for a more informed exploration of the space.

The analysis of fitness landscapes requires calculating distances between solutions to determine neighbourhoods of solutions with, for example, the same quality (plateaus) or peaks that may contain global or local optima, in the case of maximization problems. Calculating these distances will depend on how solutions are represented.

The current canonical representation of an AutoML pipeline is a tree. This representation is used because it suits the complex and hierarchical search spaces of AutoML [9]. However, this model presents two important limitations: (i) high computational complexity to calculate distance between solutions and (ii) difficulty in considering the semantic aspects of the solutions. With these drawbacks in mind, this work proposes to represent these pipelines as embeddings.

An embedding is essentially a numerical vector, generated, in our case, from training a neural network with data from a range of ML pipelines. The network is able to capture latent features of the original space and map it to the weights of the network, which can be later used to present these pipelines. Calculating the distance between embeddings is simple and has low computational complexity. In addition, embeddings have the capability to preserve the semantic aspects of the data they represent, which can be observed when they are plotted in spaces of a lower number of dimensions.

Being able to plot ML pipelines in 2D spaces also allows for the visualization of both the path taken during optimization and the distance between different model configurations, facilitating the analysis of these aspects and, consequently, promoting a more efficient exploration of the search space. Finally, many state-of-art works in machine learning have been using embeddings representations of spaces, suggesting them as a promising representation solution [15] [6].

Inspired by the work of [3] – which developed an embedding representation for symbolic regression trees and showed them to be effective to measure the distance between trees and its potential to capture semantics – this paper proposes to generate embeddings to represent AutoML pipelines. The embeddings generated are evaluated by comparing distance metrics of different representations. Moreover, visual and qualitative analysis of the search space and pipelines distance are performed to better asses the proposed representation.

The main contributions of this study are:

  • Comparison of two models to represent AutoML pipelines representations: tree and embeddings;

  • Analysis and evaluation of the use of this linear representation in the context of AutoML;

  • Investigation of semantic preservation in the representation through embeddings;

  • Development of a method for visualizing the search spaces.

In the long term, this research may contribute to the development of more robust, effective and efficient AutoML models, starting from the analysis of fitness landscapes.

This work is organized as follows. Section 2 reviews relevant literature, positioning the research within the state-of-the-art . Section 3 outlines the methodology, detailing the techniques, metrics, and tools used to achieve the objectives. Section 4 discusses the results, analyzing the effectiveness of embeddings as representations of AutoML pipelines and visualizations of their search spaces. Finally, Sect. 5 presents the conclusions from the experiments and future work directions.

2 Related Work

A few studies have proposed new representations for AutoML pipelines and different visualization tools to better understand how small changes affect the fitness landscape.

Concerning pipeline representation, research in [10] employs deep learning techniques in embeddings to optimize AutoML pipelines. The method involves creating latent representations for pipeline configurations through DeepPipe, which uses Bayesian optimization to find ideal configurations. The system improves its generalization and prediction capabilities through meta-learning. Experiments on three meta-datasets show state-of-the-art results, surpassing existing methods like OBOE, SMAC and AutoPrognosis.

With a different objective, the approach in [4] integrates NLP techniques with AutoML methods. The authors use embeddings to represent metadata from datasets and algorithm documentation, allowing AutoML systems to recommend machine learning pipelines based on textual descriptions. Their system increases pipeline optimization efficiency and shows significant improvements over frameworks like OBOE, AutoSklearn, AlphaD3M and TPOT, recommending solutions in under a second without prior runs. They also made their data, models and code publicly available.

Turning to visualization techniques, study [8] presents PipelineProfiler, an interactive visualization tool for exploring and comparing machine learning pipelines generated by different AutoML systems. Integrated into Jupyter Notebook, it facilitates pipeline analysis, including hyperparameters and evaluation metrics, and helps identify patterns that lead to better results.

In [16], AutoAIViz is presented as an interactive visualization tool that uses Conditional Parallel Coordinates (CPC) to explore model and pipeline generation steps. CPC visualization helps users understand how AutoML decisions impact fitness. A usability study showed the tool’s effectiveness in increasing the comprehensibility of AutoML processes, highlighting the importance of transparency in AutoML systems.

The authors in [14] introduce Atmseer, a visualization tool that enhances transparency and controllability in AutoML processes. It allows users to refine the search space and analyze results interactively. Atmseer offers multi-level visualizations, enabling real-time monitoring and modifications.

The methodology proposed in this paper can be coupled to any of the visualization tools aforementioned to represent ML pipelines in a 2D space. It can also help improve search by using informed decisions based on the shape of the search space.

3 Methodology

The main objective of this paper is, given a set of machine learning pipelines represented as trees, map these trees to a new representation space, where the pipelines are represented as embeddings. The embeddings are generated by training a neural network, and then extracting the weights of the final encoder layer.

The proposed methodology can be divided into 3 steps: (i) generate a training set, which involves preparing machine learning pipelines to be given as input to neural networks; (ii) configure and train the neural network; (iii) given a new tree, extract the embedding from the network (Fig. 1).

Fig. 1.
figure 1

Methodology steps to generate the pipeline represented as an embedding.

3.1 Generating the Training Set

As previously explained, an ML pipeline includes preprocessing steps (data cleaning, feature selection), a machine learning model (classification or regression) and post-processing steps. The search space used in this paper was borrowed from [9], where pipelines were generated by following a Context-Free Grammar with 38 production rules, 92 terminals, and 45 non-terminals. These include three dimensionality reduction algorithms (PCA, Select K-Best, Standard Scaler) and five classification methods (Logistic Regression, MLP, KNN, Random Forest, AdaBoost), with parameter counts ranging from two to seven. Default values from Scikit-​learn were used for some continuous parameters while relative values were employed for parameters that depend on the number of attributes. This search space was chosen because it has size 69,960, allowing all solutions to be enumerated. This facilitates testing a new representation methodology, as we know which solutions should be more similar to each other.

These generated pipelines are initially structured as a tree. They must be converted into a linear, non-hierarchical format of fixed size to be used as input for the neural network. A Depth-First Search (DFS) is performed on the graph structure to investigate the connected nodes and leaves [3], and a grammar-based parser generates a token sequence representing the original pipeline in the appropriate format.

3.2 Neural Network Configuration

The neural network purpose is to learn a function to encode the linearized pipeline string into an embedding by mapping each token of this string into an embedding space.

The trained neural network is a Transformer [13], an architecture developed for both supervised and unsupervised machine learning tasks, primarily applied to natural language processing (NLP) to capture semantics in large datasets. Transformers revolutionized NLP by implementing Self-Attention mechanisms, allowing the encoder to extract information from the entire input sequence and enabling the decoder to give more importance to specific elements of the input, depending on the token being processed at the moment. With multiple Self-Attention heads, it is possible to map various pieces of information to a single word, allowing each position in the input sequence to attribute importance to all others, improving the understanding of context and semantic relationships. Table 1 presents the Transformer parameters and tested values.

Two tree representation strategies will be examined: a detailed one with tags indicating algorithm types (i.e., PCA points to <features_dim>, <whiten> and <svd_solver> and these nodes points to the values of each parameter ) and a straightforward one with only algorithm names and parameters (i.e., PCA points directly to the values of each parameter). For the first one, the grammar vocabulary is comprised of a total of 137 tokens. In the second, 69 tokens are included. For both of them three of the tokens are for specific scenarios: <pad_token> (used for padding sequences to the same length), <unk_token> (represents unknown words or parameters) and <sos_token> (marks the start of a sequence).

Table 1. Transformer parameters.

Finally, the loss function measures the difference between the model’s predictions and the actual tokens and the LabelSmoothing strategy helps make the model less confident and more general. The loss is normalized by the number of non-padding tokens to ensure fair evaluation across sequences of varying lengths.

3.3 Extracting Embeddings from Network Layers

The encoding function learned by the Transformer in the previous step generates embeddings for each token in the input sequence. These embeddings are extracted from the final encoder layer, after passing through the attention and feed-forward sublayers and undergoing normalization. To create the complete tree embedding, it is necessary to aggregate the individual mapped tokens to obtain the full tree representation. This can be done using three aggregation functions:

  • Sum: Summing the value of each dimension of the tokens across all tokens and dimensions.

  • Mean: Calculating the mean value of each dimension of the tokens across all tokens and dimensions.

  • Concat: Concatenating all dimensions of the tokens across all tokens and dimensions.

The three aggregation functions proposed above will be tested to determine the most appropriate one.

4 Generating Tree Embeddings

Given the proposed methodology, the next step was to train the proposed transformer to generate tree representations. These datasets were generated from 3 datasets, listed in Table 2. They are available on Kaggle and the UCI Machine Learning Repository, and were selected considering their number of attributes and number of classes.

Table 2. Datasets characteristics.

For each dataset, the trees previously defined in Sect. 3.1 are used as input to train the transformer. The only differences in tree representation are in the parameters that depend on the input variables, such as the number of features in the SelectKBest algorithm. For this type of parameter, relative values were set.

Having the datasets, we tuned the number of embedding dimensions (16, 32, 64, 128 and 256) and learning rate (0.001 and 0.0001), while other parameters were kept with their default values presented in Table 3.

The network was trained for up to 50 epochs, with an early stopping mechanism. If the loss did not decrease more than a predefined threshold of \(1e-8\) for five consecutive epochs, it was considered that the model had converged, and training was stopped.

Table 3 presents the configuration and test set loss for each trained neural network. We ran each model five times to generate the variance and confidence interval.

Table 3. Transformer configuration and test loss for each model.

Figure 2 shows the evolution per epoch of the loss function of each model that resulted from the best configuration. The training loss and the test loss decrease rapidly within the first few epochs, which indicates that the model is learning quickly and effectively adjusting its parameters early in the training process.

Fig. 2.
figure 2

Learning curve.

5 Contrasting Tree-Based and Embedded Representations

One of the difficulties of the proposed methodology is how to evaluate the embedded representations when compared to the tree representation. We start by looking at the correlations between the distances considering trees and embedding representations.

The traditional way to calculate distances between tree representations is to use the tree edit distance, defined in Equation (1):

$$\begin{aligned} D(T_1, T_2) = {\left\{ \begin{array}{ll} 0 & \text {if } T_1 = T_2\\ \min {\left\{ \begin{array}{ll} D(T_1 - l_1, T_2 - l_2) + \text {cost}(l_1, l_2)\\ D(T_1 - l_1, T_2)\\ D(T_1, T_2 - l_2)\\ \end{array}\right. } & \text {otherwise.} \end{array}\right. } \end{aligned}$$
(1)

The distances between the embeddings will be measured using two measures: cosine distance (1 - the cosine similarity) and Euclidean distance. The goal is to compare whether the embeddings are suitable representations for the pipelines, evaluating the Pearson and Spearman correlation between the two representation distances. By calculating the correlation between these distances, evidence of the suitability of using embeddings for AutoML can be obtained.

Additionally, as the embeddings are encoded in high-dimensional spaces, we use UMAP  [7] to reduce their dimensionality and plot the solutions in a two-dimensional space, allowing the visualization of the search space and a qualitative analysis of the relationships between pipelines. Both are reported in the next section.

To calculate the distances between solutions using tree edit distance metrics, Euclidean distance and cosine distance, five samples with 5000 trees each were collected through random and stratified sampling to ensure diversity and representativeness in the sampled population. The use of samples is justified by the complexity of calculating tree edit distances [1, 2].

Figures 3 and 4 present the results of Spearman correlations between edit tree distance and embedding distances. The Pearson correlations are omitted due to space restrictions, but Spearman presented better results. This indicates that the relation between the tree edition and the other metrics might be monotonic but not necessarily linear.

Results showed a strong correlation between embedding distances and tree edit distances, especially in models without tags and using "sum" as the aggregation function. Euclidean distances consistently outperformed cosine distances.

The ML-Prove and Raisin datasets exhibited the highest correlations, ranging from 0.73 to 0.79, with the ML-Prove model (64 dimensions, learning rate 0.0001) performing better. The Mushrooms dataset showed lower correlations but still reached values between 0.66 and 0.71 in certain configurations. In summary:

  • Spearman Correlation with Euclidean Distance: Spearman correlations were slightly higher, with no-tag models using mean and sum functions achieving correlations up to 0.78. The Mushrooms dataset showed significantly lower correlations.

  • Spearman Correlation with Cosine Distance: Results were slightly inferior to Euclidean distance, with maximum correlations of 0.79 for no-tag models in ML-Prove. The Mushrooms dataset also showed lower correlations.

Fig. 3.
figure 3

Average across samples of Spearman correlations for Euclidean distance.

Fig. 4.
figure 4

Average across samples of Spearman correlations for cosine distance.

6 Visual Analysis of the Search Space

As previously mentioned, the embeddings also underwent a dimensionality reduction process to enable plotting them on a Cartesian plane and generating a coordinate chart. UMAP was trained using a greedy algorithm that optimized the number of neighbours, minimum distance and distance metric.

Based on the correlation results, plots generated with the aggregation functions mean and sum in the no-tag strategy are plotted. Additionally, the datasets ML-Prove and Mushrooms were selected for visualizations as they demonstrated the best and worst performance, respectively. The objective is to investigate how visualizations occur in datasets with different performance levels, understanding the potential of semantic preservation in scenarios of high correlations (0.78) and lower correlations (0.66).

In the coordinate plot in Fig. 5, each point represents a pipeline, and the set of points is the search space. Since embeddings have the potential to preserve the semantic aspects of the data they represent, similar pipelines should be closer together in this space, while different pipelines should be farther apart. The following analysis highlights this potential for semantics preservation, as pipelines with similar algorithms and hyperparameters are grouped closely together.

Figure 5a shows the visualization of the ML-Prove search space. Note that similar algorithms were clustered together. Random Forest and K-Nearest Neighbors display a clear boundary of separation in the search space, while the other algorithms - MLP, Logistic Regression and Ada Boost - are concentrated in the lower right region of the search space. Furthermore, a cluster for each of these algorithms is identifiable, although Logistic Regression is slightly more spread out than the other two. Since the search space is predominantly composed of Random Forest and K-Nearest Neighbors, it is consistent with the results that the Transformer has acquired more information about these algorithms to classify them with higher accuracy.

Figure 5b presents the search space of the Mushrooms dataset. Although this model showed the worst correlation results, an analysis of the images reveals an acceptable algorithm clustering, with the locations of each algorithm in the space being easily visualized. Again, Random Forest and K-Nearest Neighbors occupy most of the space and are in the most identifiable regions. However, the model also managed to group the other classification algorithms into perceptible regions. Comparing this visualization with the previous one, it is noteworthy that this model was better at grouping identical algorithms into smaller more concise regions, revealing similarity by both algorithm and parameters.

Fig. 5.
figure 5

Search space by classification algorithm.

The search space divided by preprocessing algorithms is presented in Fig. 6. It is worth highlighting that this type of algorithm was not the best choice to primarily cluster the pipelines but can still be useful if analyzed together with the plot showing classification algorithm 5. As observed with the classification algorithms, the Mushrooms model better separated the pipelines into groups.

Fig. 6.
figure 6

Search space grouped by preprocessing algorithm.

6.1 Investigating Clustering

Two types of clustering are worth investigating: regions where the model incorrectly grouped the pipelines, such as the lower right region in the ML-Prove dataset, and regions well-separated, such as the clustering observed in the Mushrooms search space.

When analyzing Fig. 5a, observe that the Logistic Regression, MLP, and AdaBoost algorithms appear in locations where they should not be, mixed with the KNN region in the lower right part of the search space. By analyzing Fig. 6a, notice that the preprocessing algorithm also fails to differentiate the pipelines. Although the clustering is not pure, the model still managed to separate the groups to some extent. One possible reason for the poor grouping of these three classification algorithms could be the number of pipelines that use them. With fewer training data on these three, the model has less information and does not learn as well how to classify them.

Another important investigation is to analyze the composition of clusters. It is expected that subgroups are composed of pipelines with similar algorithms and hyperparameters. Figure 7a provides a closer look at a region within one of the Random Forest clusters. In this figure, we observe sub-clusters with similar preprocessing parameters and algorithms:

  • Preprocessing algorithm: PCA with 8, 15 and 21 principal components;

  • Classification algorithm: Random Forest with entropy as criterion and sqrt as max_features hyperparameters.

Figure 7b presents another group with similar hyperparameters:

  • Preprocessing algorithm: Standar Scaler;

  • Classification algorithm: Random Forest with gini as criterion and sqrt as max_features hyperparameters

Fig. 7.
figure 7

Search space by preprocessing algorithm.

6.2 Qualitative Analysis

Finally, to provide a qualitative analysis of the embedding distance, a set of four pipelines from the best and worst-performing models - as done in the previous subsection - were selected for this experiment. The goal is to choose a pair of pipelines that appear similar and a pair of very different pipelines and then investigate their distances and locations in the search space. Table 4 presents these results for the ML-Prove dataset.

Note that the similar pairs are closer than the different ones, which is expected. The similar pairs use the same preprocessing algorithm with different parameters and the same classification algorithm with different parameters.

Table 4. Qualitative analysis for ML-Prove

For the Mushrooms dataset, results are shown in Table 5. As expected, the similar pair is closer than the different one. The comparison between the sum without the tag model and the mean with the tag model indicates that the former better captures the structures and semantics of the pipelines, as evidenced by the greater distance between the pipelines. Nevertheless, the second model can satisfactorily identify distance relationships.

Table 5. Qualitative analysis for Mushrooms.

Figure 8 illustrates the positions of each pair of pipelines. The points were jittered to disperse overlapping regions. The visual analysis suggests that UMAP successfully preserved the high-dimensional structure and mapped it into a two-dimensional space. However, it is worth noting that the Euclidean distance between the worst and best pipelines is greater in the ML-Prove dataset than in the Mushrooms dataset, yet, in the plot, it appears to be smaller.

Fig. 8.
figure 8

Location of each pair of pipeline. Stars represent similar pipelines and circles mark different ones.

7 Conclusion

This paper developed and evaluated a novel linear representation for AutoML pipelines. By transforming traditional tree-based representations into embeddings, we addressed the challenges of computational complexity to calculate distances between solutions and semantic preservation inherent in the canonical methods. The experimental results indicate that different embedding models exhibit strong correlations with traditional tree edit distances, suggesting that embeddings can effectively capture the underlying relationships between different pipeline configurations.

The visualizations of the search space validate the potential of embeddings as a suitable representation for pipelines. Analysis of the plots reveals distinct algorithm regions and clusters and qualitative analysis shows that the embeddings’ distances align with the apparent pipeline distances.

Future work will focus on further studies regarding embeddings, such as investigating the preservation of local optima networks (LONs) and using neighborhood plots to explore these relationships more deeply.

Overall, this research contributes to the advancement of AutoML by proposing a robust, efficient, and semantically meaningful representation for pipelines, enhancing the explicability of the optimization process and paving the way for more effective and transparent AutoML solutions.