key: cord-0672301-u9a4gi8d
authors: Jeong, Wonyong; Lee, Hayeon; Park, Gun; Hyung, Eunyoung; Baek, Jinheon; Hwang, Sung Ju
title: Task-Adaptive Neural Network Search with Meta-Contrastive Learning
date: 2021-03-02
journal: nan
DOI: nan
sha: afc3e33e4d3102b0fefcd513a312660c1b25142a
doc_id: 672301
cord_uid: u9a4gi8d

Most conventional Neural Architecture Search (NAS) approaches are limited in that they only generate architectures without searching for the optimal parameters. While some NAS methods handle this issue by utilizing a supernet trained on a large-scale dataset such as ImageNet, they may be suboptimal if the target tasks are highly dissimilar from the dataset the supernet is trained on. To address such limitations, we introduce a novel problem of emph{Neural Network Search} (NNS), whose goal is to search for the optimal pretrained network for a novel dataset and constraints (e.g. number of parameters), from a model zoo. Then, we propose a novel framework to tackle the problem, namely emph{Task-Adaptive Neural Network Search} (TANS). Given a model-zoo that consists of network pretrained on diverse datasets, we use a novel amortized meta-learning framework to learn a cross-modal latent space with contrastive loss, to maximize the similarity between a dataset and a high-performing network on it, and minimize the similarity between irrelevant dataset-network pairs. We validate the effectiveness and efficiency of our method on ten real-world datasets, against existing NAS/AutoML baselines. The results show that our method instantly retrieves networks that outperform models obtained with the baselines with significantly fewer training steps to reach the target performance, thus minimizing the total cost of obtaining a task-optimal network. Our code and the model-zoo are available at https://github.com/wyjeong/TANS.

Neural Architecture Search (NAS) aims to automate the design process of network architectures by searching for high-performing architectures with RL [76, 77] , evolutionary algorithms [43, 11] , parameter sharing [6, 42] , or surrogate schemes [38] , to overcome the excessive cost of trial-and-error approaches with the manual design of neural architectures [47, 23, 27] . Despite their success, existing NAS methods suffer from several limitations, which hinder their applicability to practical scenarios. First of all, the search for the optimal architectures usually requires a large amount of computation, which can take multiple GPU hours or even days to finish. This excessive computation cost makes it difficult to efficiently obtain an optimal architecture for a novel dataset. Secondly, most NAS approaches only search for optimal architectures, without the consideration of their parameter values. Thus, they require extra computations and time for training on the new task, in addition to the architecture search cost, which is already excessively high. For this reason, supernet-based methods [8, 37] that search for a sub-network (subnet) from a network pretrained on large-scale data, are attracting more popularity as it eliminates the need for additional Then, we train our retrieval model via amortized meta-learning of a cross-modal latent space with a contrastive learning objective. Specifically, we encode each dataset with a set encoder and obtain functional and topological embeddings of a network, such that a dataset is embedded closer to the network that performs well on it while minimizing the similarity between irrelevant dataset-network pairs. The learning process is further guided by a performance predictor, which predicts the model's performance on a given dataset. The proposed Task-Adaptive Network Search (TANS) largely outperforms conventional NAS/AutoML methods (See Figure 2) , while significantly reducing the search time. This is because the retrieval of a trained network can be done instantly without any additional architecture search cost, and retrieving a task-relevant network will further reduce the fine-tuning cost. To evaluate the proposed TANS, we first demonstrate the sample-efficiency of our model zoo construction method, over construction with random sampling of dataset-network pairs. Then, we show that the TANS can adaptively retrieve the best-fitted models for an unseen dataset. Finally, we show that our method significantly outperforms baseline NAS/AutoML methods on real-world datasets (Figure 2) , with incomparably smaller computational cost to reach the target performance. In sum, our main contributions are as follows:

• We consider a novel problem of Neural Network Search, whose goal is to search for the optimal network for a given task, including both the architecture and the parameters. • We propose a novel cross-modal retrieval framework to retrieve a pretrained network from the model zoo for a given task via amortized meta-learning with constrastive objective. • We propose an efficient model-zoo construction method to construct an effective database of dataset-architecture pairs, considering both the model performance and task diversity. • We train and validate TANS on a newly collected large-scale database, on which our method outperforms all NAS/AutoML baselines with almost no architecture search cost and significantly fewer fine-tuning steps.

Neural Architecture Search Neural Architecture Search (NAS), which aims to automate the design of neural architectures, is an active topic of research nowadays. Earlier NAS methods use non-differentiable search techniques based on RL [76, 77] or evolutionary algorithms [43, 11] . However, their excessive computational requirements [44] in the search process limits their practical applicability to resource-limited settings. To tackle this challenge, one-shot methods share the parameters [42, 6, 35, 65] among architectures, which reduces the search cost by orders of magnitude. The surrogate scheme predicts the performance of architectures without directly training them [38, 75, 54] , which also cuts down the search cost. Latent space-based NAS methods [38, 54, 67] learn latent embeddings of the architectures to reconstruct them for a specific task. Recently, supernetbased approaches, such as OFA [8] , receive the most attention due to their high-performances. OFA generates the subnet with its parameters by splitting the trained supernet. While this eliminates the need for costly re-training of each searched architecture from scratch, but, it only trains a fixedsupernet on a single dataset (ImageNet-1K), which limits their effectiveness on diverse tasks that are largely different from the training set. Whereas our TANS task-adaptively retrieves a trained neural network from a database of networks with varying architectures trained on diverse datasets.

The goal of meta-learning [55] is to learn a model to generalize over the distribution of tasks, instead of instances from a single task, such that a meta-learner trained across multiple tasks can rapidly adapt to a novel task. While most meta-learning methods consider few-shot classification with a fixed architecture [56, 20, 48, 40, 33, 30] , there are a few recent studies that couple NAS with meta-learning [46, 34, 17] to search for the well-fitted architecture for the given task. However, these NAS approaches are limited to small-scale tasks due to the cost of roll-out gradient steps. To tackle this issue, MetaD2A [31] proposes to generate task-dependent architectures with amortized meta-learning, but does not consider parameters for the searched architecture, and thus requires additional cost of training it on unseen datasets. To overcome these limitations, our method retrieves the best-fitted architecture with its parameters for the target task, by learning a cross-modal latent space for dataset-network pairs with amortized meta-learning.

Neural Retrieval Neural retrieval aims to search for and return the best-fitted item for the given query, by learning to embed items in a latent space with a neural network. Such approaches can be broadly classified into models for image retrieval [21, 14, 66] and text retrieval [73, 9, 63] . Crossmodal retrieval approaches [32, 74, 58] handle retrieval across different modalities of data (e.g. image and text), by learning a common representation space to measure the similarity across the instances from different modalities. To our knowledge, none of the existing works is directly related to our approach that performs cross-modal retrieval of neural networks given datasets. We first construct our modelzoo with pareto-optimal pairs of dataset and network, rather than exhaustively train all possible pairs. We then embed a model and a dataset with a graph-functional model and a set encoder. After that, we meta-learn the cross-modal retrieval network over multiple model-query pairs, guided by our performance predictor.

Task-Adaptive Neural Network Retrieval To learn a cross-modal latent space for dataset-network pairs over a task distribution, we first introduce a novel task-adaptive neural network retrieval problem. The goal of task-adaptive retrieval is to find an appropriate network M τ given the query dataset D τ for task τ . To this end, we need to calculate the similarity between the dataset-network pair (D τ , M τ ) ∈ Q × M, with a scoring function f that outputs the similarity between them as follows:

where E Q : Q → R d is a query (dataset) encoder, E M : M → R d is a model encoder, which are parameterized with the parameter θ and φ respectively, and f sim : R d × R d → R is a scoring function for the query-model pair. In this way, we can construct the cross-modal latent space for dataset-network pairs over the task distribution with equation 1, and use this space to rapidly retrieve the well-fitted neural network in response to the unseen query dataset.

We can learn such a cross-modal latent space of dataset-network pairs for rapid retrieval by directly solving for the above objective, with the assumption that we have the query and the model encoder: Q and M . However, we further propose a contrastive loss to maximize the similarity between a dataset and a network that obtains high performance on it in the learned latent space, and minimize the similarity between irrelevant dataset-network pairs, inspired by Faghri et al. [19] , Engilberge et al. [18] . While existing works such as Faghri et al. [19] , Engilberge et al. [18] target image-to-text retrieval, we tackle the problem of cross-modal retrieval across datasets and networks, which is a nontrivial problem as it requires task-level meta-learning.

Retrieval with Meta-Contrastive Learning Our meta-contrastive learning objective for each task τ ∈ p(τ ) consisting of a dataset-model pair (D τ , M τ ) ∈ Q × M, aims to maximize the similarity between positive pairs: f sim (q, m + ), while minimizing the similarity between negative pairs: f sim (q, m − ), where m + is obtained from the sampled target task τ ∈ p(τ ) and m − is obtained from other tasks γ ∈ p(τ ); γ = τ , which is illustrated in Figure 3 . This meta-contrastive learning loss can be formally defined as follows:

We then introduce L for the meta-contrastive learning:

where α ∈ R is a margin hyper-parameter and the score function f sim is the cosine similarity. The contrastive loss promotes the positive (q, m + ) embedding pair to be close together, with at most margin α closer than the negative (q, m − ) embedding pairs in the learned cross-modal metric space. Note that, similar to this, we also contrast the query with its corresponding model:

, which we describe in the supplementary material in detail.

With the above ingredients, we minimize the meta-contrastive learning loss over a task distribution p(τ ), defined with the model L m and query L q contrastive losses, as follows:

Meta-Performance Surrogate Model We propose the meta-performance surrogate model to predict the performance on an unseen dataset without directly training on it, which is highly practical in real-world scenarios since it is expensive to iteratively train models for every dataset to measure the performance. Thus, we meta-train a performance surrogate model a = S(τ ; ψ) over a distribution of tasks p(τ ) on the model-zoo database. This model not only accurately predicts the performance a of a network M τ on an unseen dataset D τ , but also guides the learning of the cross-modal retrieval space, thus embedding a neural network closer to datasets that it performs well on.

Specifically, the proposed surrogate model S takes a query embedding q τ and a model embedding m τ as an input for the given task τ , and then forwards them to predict the accuracy of the model for the query. We train this performance predictor S(τ ; ψ) to minimize the mean-square error loss L s (τ ; ψ) = (s τ acc − S(τ ; ψ)) 2 between the predicted accuracy S(τ ; ψ) and the true accuracy s τ acc for the model on each task τ , which is sampled from the task distribution p(τ ). Then, we combine this objective with retrieval objective in equation 4 to train the entire framework as follows:

where λ is a hyper-parameter for weighting losses.

By leveraging the meta-learned cross-modal retrieval space, we can instantly retrieve the best-fitted pretrained network M ∈ M, given an unseen query datasetD ∈Q, which is disjoint from the meta-training dataset D ∈ Q. Equipped with meta-training components described in the previous subsection, we now describe the details of our model at inference time, which includes the following: amortized inference, performance prediction, and task-and constraints-adaptive initialization.

Amortized Inference Most existing NAS methods are slow as they require several GPU hours for training, to find the optimal architecture for a datasetD. Contrarily, the proposed Task-Adaptive Network Search (TANS) only requires a single forward pass per dataset, to obtain a query embedding q for the unseen dataset using the query encoder Q(D; θ * ) with the meta-trained parameters θ * , since we train our model with amortized meta-learning over a distribution of tasks p(τ ). After obtaining the query embedding, we retrieve the best-fitted network M * for the query based on the similarity:

where a set of model embeddings {m τ | τ ∈ p(τ )} is pre-computed by the meta-trained model encoder E M (M τ ; φ * ).

Performance Prediction While we achieve the competitive performance on unseen dataset only with the previously defined model, we also use the meta-learned performance predictor S to select the best performing one among top K candidate networks {M i } K i=1 based on their predicted performances. Since this surrogate model with module to consider datasets is meta-learned over the distribution of tasks p(τ ), we predict the performance on an unseen datasetD without training on it. This is different from the conventional surrogate models [38, 75, 54] that additionally need to train on an unseen dataset from scratch, to predict the performance on it.

Task-adaptive Initialization Given an unseen dataset, the proposed TANS can retrieve the network trained on a training dataset that is highly similar to the unseen query dataset from the model zoo (See Figure 4) . Therefore, fine-tuning time of the trained network for the unseen target datasetD is effectively reduced since the retrieved network M has task-relevant initial parameters that are already trained on a similar dataset. If we need to further consider constraints s, such as parameters and FLOPs, then we can easily check if the retrieved models meet the specific constraints by sorting them in the descending order of their scores, and then selecting the constraint-fitted best accuracy model.

Query Encoder The goal of the proposed query encoder E Q (D; θ) : Q → R d is to embed a dataset D as a single query vector q onto the cross-modal latent space. Since each dataset D consists of n data instances D = {X i } n i=1 ∈ Q, we need to fulfill the permutation-invariance condition over the data instances X i , to output a consistent representation regardless of the order of its instances. To satisfy this condition, we first individually transform n randomly sampled instances for the dataset D with a continuous learnable function ρ, and then apply a pooling operation to obtain the query vector q = Xi∈D ρ(X i ), adopting Zaheer et al. [69] .

Model Encoder To encode a neural network M τ , we consider both its architecture and the model parameters trained on the dataset D τ for each task τ . Thus, we propose to generate a model embedding with two encoding functions: 1) topological encoding and 2) functional encoding.

Following Cai et al. [8] , we first obtain a topological embedding v τ t with auxiliary information about the architecture topology, such as the numbers of layers, channel expansion ratios, and kernel sizes. Then, our next goal is to encode the trained model parameters for the given task, to further consider parameters on the neural architecture. However, a major problem here is that directly encoding millions of parameters into a vector is highly challenging and inefficient. To this end, we use functional embedding, which embeds a network solely based on its input-output pairs. This operation generates the embedding of trained networks, by feeding a fixed Gaussian random noise into each network M τ , and then obtaining an output v τ f of it. The intuition behind the functional embedding is straightforward: since networks with different architectures and parameters comprise different functions, we can produce different outputs for the same input.

With the two encoding functions, the proposed model encoder generates the model representation by concatenating the topology and function embeddings [v τ t , v τ f ], and then transforming the concatenated vector with a non-linear function σ as follows:

. Note that, the two encoding functions satisfy the injectiveness property under certain conditions, which helps with the accurate retrieval of the embedded elements in a condensed continuous latent space. We provide the proof of the injectiveness of the two encoding functions in Section C of the supplementary file.

Given a set of datasets D = {D 1 , . . . , D K } and a set of architectures M = {M 1 , . . . , M N }, the most straightforward way to construct a model zoo Z, is by training all architectures on all datasets, which will yield a model zoo Z that contains N × K pretrained networks. However, we may further reduce the construction cost by collecting P dataset-model pairs, {D, M } P i=1 , where P N ×K, by randomly sampling an architecture M ∈ M and then training it on D ∈ D. Although this works well in practice (see Figure 8 (Bottom)), we further propose an efficient algorithm to construct it in a more sample-efficient manner, by skipping evaluation of dataset-model pairs that are certainly worse than others in all aspects (memory consumption, latency, and test accuracy). We start with an initial model zoo Z (0) that contains a small amount of randomly selected pairs and its test accuracies. Then, at each iteration t, among the set of candidates C (t) , we find a pair {D, M } that can expand the currently known set of all Pareto-optimal pairs w.r.t. all conditions (memory, latency, and test accuracy on the dataset D), based on the amount of the Pareto front expansion estimated by f zoo (·; Z (t) ):

where

parameter ψ zoo trained on Z, and the function g D measures the volume under the Pareto curve, also known as the Hypervolume Indicator [41] , for the dataset D. We then train M on D, and add it to the current model zoo Z (t) . For the full algorithm, please refer to Appendix A.

In this section, we conduct extensive experimental validation against conventional NAS methods and commercially available AutoML platforms, to demonstrate the effectiveness of our proposed method.

Datasets We collect 96 real-world image classification datasets from Kaggle * . Then we divide the datasets into two non-overlapping sets for 86 meta-training and 10 meta-test datasets. As some datasets contain relatively large number of classes than the other datasets, we adjust each dataset * https://www.kaggle.com/ to have up to 20 classes, yielding 140 and 10 datasets for meta-training and meta-test datasets, respectively (Please see Table 5 for detailed dataset configuration). For each dataset, we use randomly sampled 80/20% instances as a training and test set. To be specific, our 10 meta-test datasets include Colorectal Histology, Drawing, Dessert, Chinese Characters, Speed Limit Signs, Alien vs Predator, COVID-19, Gemstones, and Dog Breeds. We strictly verified that there is no dataset-, class-, and instance-level overlap between the meta-training and the meta-test datasets, while some datasets may contain semantically similar classes.

Baseline Models We consider MobileNetV3 [26] pretrained on ImageNet as our baseline neural network. We compare our method with conventional NAS methods, such as PC-DARTS [65] and DrNAS [10] , weight-sharing approaches, such as FBNet [60] and Once-For-All [8] , and datadriven meta-NAS approach, e.g. MetaD2A [31] . All these NAS baseline approaches are based on MobileNetV3 pretrained on ImageNet, except for the conventional NAS methods that are only able to generate architectures. As such conventional NAS methods start from the scratch, we train them for sufficient amount of training epochs (10 times more training steps) for fair comparison. Please see Appendix B for further detailed descriptions of the experimental setup.

We follow the OFA search space [8] , which allows us to design resourceefficient architectures, and thus we obtain network candidates from the OFA space. We sample 100 networks condidates per meta-training dataset, and then train the network-dataset pairs, yielding 14,000 dataset-network pairs in our full model-zoo. We also construct smaller-sized efficient modelzoos from the full model-zoo (14,000) with our efficient algorithm described in Section 3.3. We use the full-sized model-zoo as our base model-zoo, unless otherwise we clearly mention the size of the used model-zoo. Detailed descriptions, e.g. training details and costs, are described in Appendix D.2. MetaD2A (data-driven meta-NAS), which provide the searched architectures as well as pretrained weights from ImageNet-1K, we fine-tune them on the meta-test query datasets for 50 epochs. As shown in Table 1 , we observe that TANS outperforms all baselines, with incomparably smaller search time and relatively smaller training time.

Conventional NAS approaches such as PC-DARTS and DrNAS repeatedly require large search time for every dataset, and thus are inefficient for this practical setting with real-world datasets. FBNet, OFA, and MetaD2A are much more efficient than general NAS methods since they search for subnets within a given supernet, but obtain suboptimal performances on unseen real-world datasets as they may have largely different distributions from the dataset the supernet is trained on. In contrast, our method achieves almost zero cost in search time, and reduced training time as it fine-tunes a network pretrained on a relevant dataset. In Figure 5 , we show the test performance curves and observe that TANS often starts with a higher starting point, and converges faster to higher accuracy.

In Figure 4 , we show example images from the query and training datasets that the retrieved models are pretrained on. In most cases, our method matches semantically similar datasets to the query dataset. Even for the semantically-dissimilar cases (right column), for which our model-zoo does not contain models pretrained on datasets similar to the query, our models still outperform all other base NAS models. As such, our model effectively retrieves not only task-relevant models, but also potentially best-fitted models even trained with dissimilar datasets, for the given query datasets. We provide detailed descriptions for all query and retrieval pairs in Figure 9 of Appendix.

We also compare with commercially available AutoML platforms, such as Microsoft Azure Custom Vision [1] and Google Vision Edge [2] . For this experiment, we evaluate on randomly chosen five datasets (out of ten), due to excessive training costs required on AutoML platforms. As shown in Figure 2 , our method outperforms all commercial NAS/AutoML methods, with a significantly smaller total time cost. We provide more details and additional experiments, such as including real-world architectures, in Appendix E.

To see that our method is effective in retrieving networks with both optimal architecture and relevant parameters, we conduct several ablation studies. We first report the results of base models that only search for the optimal architectures. Then we provide the results of the network retrieved using a variant of our method which does not use the topology (architecture) embedding, and only uses the functional embedding v τ f (Tans w/o Topol). As shown in Figure 6 (d), TANS w/o Topol outperforms base NAS methods (except for MetaD2A) without considering the architectures, which shows that the performance improvement mostly comes from the knowledge transfer from the weights of the most relevant pretrained networks. However, the full TANS obtains about 1% higher performance over TANS w/o Topol., which shows the importance of the architecture and the effectiveness of our architecture embedding. In Figure 6 (e), we further experiment with a version of our method that initializes the retrieved networks with both random weights and ImageNet pre-trained weights, using 1/20 sized model-zoo (700). We observe that they achieve lower accuracy over TANS on 10 datasets, which again shows the importance of retrieving relevant tasks' knowledge. We also construct the model-zoo by training on the architectures found by an existing NAS method (MetaD2A), and see that it further improves the performance of TANS.

Contraints-conditioned Retrieval TANS can retrieve models with a given dataset and additional constraints, such as the number of the parameters or the computations (in FLOP). This is practically important since we may need a network with less memory and computation overhead depending on the hardware device. This can be done by filtering networks that satisfy the given conditions among the candidate networks retrieved. For this experiment, we compare against OFA that performs the same constrained search, as other baselines do not straightforwardly handle this scenario. As shown in Figure 6 (a) and (b), we observe that the network retrieved with TANS consistently outperforms the network searched with OFA under varying parameters and computations constraints. Such constrained search is straightforward with our method since our retrieval-based method allows us to search from the database consisting of networks with varying architectures and sizes.

Analysis of the Cross-Modal Retrieval Space We further examine the learned cross-modal space. We first visualize the meta-learned latent space in Figure 6 (c) with randomly sampled 1,400 models among 14,000 models in the model-zoo. We observe that the network whose embeddings are the closest to the query dataset achieves higher performance on it, compared to networks embedded the farthest from it. For example, the accuracy of the closet network point for UCF-AI is 98.94% while the farthest network only achieves 91.53%. We also show Spearman correlation scores on 5 meta-test datasets in Table 2 . Measuring correlation with the distances is not directly compatible with our contrastive loss, since the negative examples (models that achieve low performance on the target dataset) are pushed away from the query point, without a meaningful ranking between the negative instances. To obtain a latent space where the negative examples are also well-ranked, we should replace the contrastive loss with a ranking loss instead, but this will not be very meaningful. Hence, to effectively validate our latent space with correlation metric, we rather select clusters, such that 50 models around the query point and another 50 models around the farthest points, thus using a total of 100 models to report the correlation scores. In the Table 2 , we show the correlation scores of these 100 models on the five unseen datasets. For Food dataset (reported as "hard" dataset in Table 1 ), the correlation scores are shown to be high. On the other hand, for Colorectal Histology dataset (reported as "easy" dataset), the correlation scores are low as any model can obtain good performance on it, which makes the performance gap across models small. In sum, as the task (dataset) becomes more difficult, we can observe a higher correlation in the latent space between the distance and the rank.

The role of the proposed performance predictor is not only guiding the model and query encoder to learn the better latent space but also selecting the best model among retrieval candidates. To verify its effectiveness, we measure the performance gap between the top-1 retrieved model w/o the predictor and the model with the highest scores selected using the predictor among retrieved candidates. As shown in Figure 7 (a), there are 1.5%p -8%p performance gains on the meta-test datasets. The top-1 retrieved model from the model zoo with TANS may not be always optimal for an unseen dataset, and our performance predictor remedies this issue by selecting the best-fitted model based on its estimation. We also examine ablation study for our performance predictor. Please note that we do not use ranking loss which does not rank the negative examples. Thus we use Mean Squared Error (MSE) scores. We retrieve the top 10 most relevant models for an unseen query datasets and then compute the MSE between the estimated scores and the actual ground truth accuracies. As shown in Figure 7 (c), we observe that removing either query or model embeddings degrades performance compared to the predictor taking both embeddings. It is natural that, with only model or query information, it is difficult to correctly estimate the accuracy since the predictor fails to recognize what or where to learn. Also, we report the MSE between the predicted performance using the predictor and the ground truth performance of each model for the entire set of pretrained models from a smaller model zoo in Figure 7 (d). Although the performance predictor achieves slightly higher MSE scores for this experiment compared to the MSE obtained on the top-10 retrieved models (which are the most important), the MSE scores are still meaningfully low, which implies that our performance model successfully works even toward the entire model-zoo.

We also verify whether our model successfully retrieves the same paired models when the correspondent meta-train datasets are given (we use unseen instances that are not used when training the encoders.) For the evaluation metric, we use recall at k (R@k) which indicates the percentage of the correct models retrieved among the top-k candidates for the unseen query instances, where k is set to 1, 5, and 10. Also, we report the mean and median ranking of the correct network among all networks for the unseen query. In Figure 7 (b), the largest parameter selection strategy shows poor performances on the retrieval problem, suggesting that simply selecting the largest network is not a suitable choice for real-world tasks. In addition, compared to cosine similarity learning, the proposed meta-contrastive learning allows the model to learn significantly improved discriminative latent space for cross-modal retrieval. Moreover, without our performance predictor, we observe that TANS achieves slightly lower performance, while it is significantly degenerated when training without functional embeddings. Analysis of Model-Zoo Construction Unlike most existing NAS methods, which repeatedly search the optimal architectures per dataset from their search spaces, TANS do not need to perform such repetitive search procedure, once the modelzoo is built beforehand. We are able to adaptively retrieve the relevant pretrained models for any number of datasets from our model-zoo, with almost zero search costs. Formally, TANS reduces the time complexities of both search cost and pre-training cost from O(N) to O(1), where N is the number of datasets, as shown in Figure 8 (Top) . Furthermore, a model zoo constructed using our efficient construction algorithm, introdcued in Section 3.3, yields models with higher performance on average, compared to the random sampling strategy when the size of the model-zoo is the same as shown in Figure 8 (Bottom).

We propose a novel Task-Adaptive Neural Network Search framework (TANS), that instantly retrieves a relevant pre-trained network for an unseen dataset, based on the cross-modal retrieval of datasetnetwork pairs. We train this retrieval model via amortized meta-learning of the cross-modal latent space with contrastive learning, to maximize the similarity between the positive dataset-network pairs, and minimize the similarity between the negative pairs. We train our framework on a model zoo consisting of diverse pretrained networks, and validate its performance on ten unseen datastes.

The results show that the proposed TANS rapidly searches and trains a well-fitted network for unseen datasets with almost no architecture search cost and significantly fewer fine-tuning steps to reach the target performance, compared to other NAS methods. We discuss the limitation and the societal impact of our work in Appendix F. are 512 (except for the last classification layer), rather than using raw images, simply to reduce computation costs. We then use a linear layer with 512 dimensions, followed by M ean pooling and L2 normalization, which outputs encoded vectors with 128 dimensions. As we use Deep set [69] , we tried performing Sum pooling, instead of M ean pooling, however, we observe that taking the average on instances shows better R@k scores for the correct pair retrieval, and thus we use M ean pooling when encoding query samples.

Our model encoder takes both OFA flat topology [8] and functional embedding as an input. For the flat topology, it contains information such as kernel size, width expansion ratio, and depth, in a 45-dimensional vector. In addition, the functional embedding, which bypasses the need for direct parameter encoding, represents models' learned knowledge in a 1536-dimensional vector. We first concatenate both vectors and normalize the vector. Then, we learn a projection layer, which is a 1581-length fully-connected layer, followed by L2 norm operation, which outputs encoded vectors with 128 dimensions.

Our performance predictor takes both query and model embeddings simultaneously. Both embedding vectors are 128-dimensional vectors. We first concatenate the embeddings into 256-dimensional vectors and then forward them through a 256-length fully connected layer. We then produce a single continuous value for a predicted accuracy. We perform a sigmoid operation on the values to range the values into a scale from 0.0 to 1.0.

There are two steps of training required for our Task-Adaptive Network Search (TANS): 1) training the crossmodal latent space and 2) fine-tuning the retrieved model on an unseen meta-test dataset.

For the model-zoo encoding, we set the batch size to 140 as we have 140 different datasets. Since, for each dataset, we randomly choose one model among 100 models from each dataset. Then we minimize the contrastive loss on the 140 samples. Although we train our encoders over a large number of dataset-network pairs (14,000 models), the entire training time takes less than two hours based on NVIDIA's RTX 2080 TI GPU. We initialize our model weights with the same value across all encoders and experiments, rather than differently initializing the encoders for every experimental trial. We use the Adam optimizer (We use the learning rate of 1e-2).

For the fine-tuning phase, we set all settings, such as hyper-parameters, learning rate, optimizer, etc., exactly the same across all baseline models and our method, and the differences are clearly mentioned in this section otherwise. We use the SGD optimizer with an initial learning rate (1e-2), weight decay (4e-5), and momentum (0.9). Also, we use the Cosine Annealing learning rate scheduler. We train the models with 224 sized images (after resizing) and we set batch-size to 32, except PC-DARTS which has memory issues with 224 sized images (for PC-DARTS, we set to 12 as the batch-size), and DrNAS which we train with 32 sized images due to heavy training time costs. We train all models for 50 epochs and we show that our model converges faster than all baseline models.

For the model-zoo consisting of 14,000 random pairs used in the main experiment, we fine-tune the ImageNet1K pretrained OFA models on the dataset for 625 epochs, following the progressive shrinking method described in [8] . We then choose 100 random OFA architectures for each dataset and evaluate their test accuracies on the test split.

For the efficiently constructed model zoo experiment, we use the algorithm described in Section 3.3 and further elaborated in Section A.1, using the 14,000-pair model zoo as the search space. For the initial samples, we use Ninit = 750, where 5-6 samples are taken from each dataset. The accuracy predictor is retrained from scratch every 64 iterations until the validation accuracy no longer improves for 5 epochs.

In this section, we show that the proposed query and model encoders can represent the injective function on the input query D ∈ Q and model M ∈ M, respectively.

Proposition 1 (Injectiveness on Query Encoding). Assume Q and D are finite sets. A query encoder EQ : Q → R d can injectively map two different queries D1, D2 into distinct embeddings q1, q2, where D ∈ Q and q ∈ R d .

Proof. A query encoder EQ maps a query dataset D ∈ Q to a vector q ∈ R d as follows: EQ : D → q, where Q is a set of queries, which contains a set of data instances X for constructing a dataset D = {X1, X2, ..., Xn}. Then, our goal here is to make a query encoder that uniquely maps two different queries D1, D2 into two distinct embeddings q1, q2.

Each dataset D consists of n data instances: D = {X1, X2, ..., Xn}, where n is smaller than the number of elements in N. To encode each query dataset D into a vector space, as described in query encoder paragraph of section 3.2, we first transform each instance Xi into the representation space with a continuous function ρ, and then aggregate all set elements, which is adopted from Zaheer et al. [69] . In other words, a query encoder can be defined as follows: q = X i ∈D ρ(Xi).

We assume that Q is a finite set, and each D = {X1, X2, ..., Xn} ∈ Q is also a finite set with |D| = n elements. Therefore, a set of data instances X is countable, since the product of two nature numbers (i.e. |Q| × n) is a natural number. For this reason, there can be a unique mapping Z from the element X to the nature number in N. If we let ρ(X) = 4 −Z(X) , then the form of a query encoder X i ∈D ρ(Xi) constitutes an unique mapping for every set D ∈ Q (see Zaheer et al. [69] , Wagstaff et al. [57] for details). In other words, the output of the query encoder is unique for each input dataset D that consists of n data instances.

Thanks to the universal approximation theorem [25, 24] , we can construct a mapping function ρ using multi-layer perceptrons (MLPs). We first show that the topological encoding function EM T : M → vt can uniquely represent each architecture M in the embedding space. As described in the Model Encoder part of section A, we use a 45-dimensional vector that contains topological information, such as the number of layers, channel expansion ratios, and kernel sizes (See Cai et al. [8] for details), for the topological encoding. Also, each topological information uniquely defines each neural architecture. Therefore, the embedding vt from the topological encoding function EM T is unique on each neural network M .

While we can obtain the distinct embedding of each neural network with the topological encoding function alone, we also consider the injectiveness of the functional encoding in the following. To consider the functional embedding, we first model a neural architecture as its computational graph, which can be further denoted as a directed acyclic graph (DAG). Using this computational graph scheme, a functional model encoder EM F maps an architecture (computational graph) M ∈ M into a vector v f as follows: EM F : M → v f . Then, our goal here is to make the functional encoder EM F that uniquely maps two different neural architectures M1, M2 into two different embeddings v f 1 , v f 2 , with the computational graph represented as the DAG structure.

Assume that a computational graph for a neural network M has n nodes. Then, each node vi on the graph has its corresponding operation oi, which transforms incoming features for the node vi to an output representation Ci. In other words, Ci indicates the output of the composition of all operations along the path from v1 to vi.

Particularly, in our model encoder case, we have an arbitrary input signal x that is the fixed Gaussian noise, where we insert this fixed input into the starting node v1 (See Model Encoder paragraph of section 3.2 for details). Also, for the simplicity of the notation, we set C0 = x that is the output of the virtual node v0 and the incoming representation of the starting node of the computational graph. Then, the output representation for the node vi is formally defined as follows: Ci(x) = oi({Cj(x) : vj → vi}), where {Cj(x) : vj → vi} denotes a multiset for the output representation of vi's predecessors, and the operation oi transforms the incoming representations over the multiset into the output representation. Note that, to consider the multiplicity of the nodes on a graph, we use a multiset scheme, rather than a set [64] . Table 5 .

graph for the network M with the fixed Gaussian input noise x is uniquely represented with the functional encoder EM F : M → v f , where v f = Cn with n nodes on the graph.

Note that we use a network M that is task-adaptively trained for a specific target dataset to not only obtain high performance on the target dataset but also reduce the fine-tuning cost on it. Thus, while we might further need to consider the parameters on the computational graph, we show the injectiveness on the functional encoding only with the computational graph structure and leave the consideration of parameters as a future work, since it is complicated to formally define the injectiveness with trainable parameters.

To sum up, we show the injectiveness of the model representation with both topological encoding and functional encoding schemes, although only one encoding function can injectively represent the entire neural network. While we further concatenate and transform two output representations with a function g, to obtain the final model representation: m = g([vt, v f ]), the representation m is also unique on each neural network M with an injective function g.

Similar to the universal approximation theorem [25, 24] , we might construct an injective mapping function g and ω with learnable parameters on it.

Before constructing a model zoo that contains a large number of dataset-architecture pairs, we first need to define an architecture search space on it to handle all architectures in a consistent manner. To easily obtain the task-adaptive parameters for the given task with consideration of various factors, such as a number of layers, kernel sizes, and width expansion ratios, we use the supernet-based OFA architecture space [8] , which the same as the well-known MobileNetV3 space [26] . Each neural architecture in the search space consists of a stack of 20 mobile-block convs (MBconvs), where the number of units is 5, and the number of layers on each unit ranges across {2, 3, 4}. Moreover, for each layer, we select the kernel size is from {3, 5, 7}, and the width-depth ratio from {3, 4, 6}. This strategy allows us to generate around 10 19 neural architecture candidates in theory.

To construct a model zoo consisting of a large number of dataset-architecture pairs, we collect 89 real-world datasets for image classification from Kaggle * and obtain 100 random architectures per dataset from the OFA space. Specifically, we first divide the collected datasets into two non-overlapping sets for meta-training and meta-testing. If the dataset has more than 20 classes, then we randomly split it into multiple datasets such that a dataset can consist of up to 20 classes. For meta-testing, we randomly selected only one of the splits for each original dataset for diversity. This process yields 140 datasets for meta-training and 10 datasets for meta-testing.

To generate a validation set for each dataset, we randomly sample 20% of data instances from each dataset and use the sampled instances for the validation, while using the remaining 80% as the training instances.

For statistics, the number of classes ranges from 2 to 20 with a median of 16, and the number of instances for each dataset ranges from 8 to 158K with a mean of 2,847. We then construct the model-zoo by fine-tuning 100 random OFA architectures on training instances of each dataset and obtaining their performances on its respective validation instances, which yields 14K (dataset, architecture, accuracy) tuples in total. We use this database throughout this paper.

In Table 5 for the detailed information for each dataset.

Here we describe the baselines we use in the experiments in the main document. We compare the performance of the models retrieved with our method against pretrained neural networks as well as the ones searched by several efficient NAS methods that are closely related to ours:

MobileNetV3 is a representative resource-efficient neural architecture tuned considering mobile phone environments. In our experiments, MobileNetV3 is pretrained on ImageNet-1K, which is fine-tuned for 50 epochs on each meta-testing task.

2) PC-DARTS [65] , a differentiable NAS method based on a weight sharing scheme that reduces search time efficiently and especially improves memory usage, search time, performance compared to DARTS [35] by designing partial channel sampling and edge normalization. We search for architectures for each meta-testing task by following the official code at https://github.com/yuhuixu1993/PC-DARTS.

3) DrNAS [10] , a differentiable NAS method that handles NAS as a distribution problem, modeled by Dirichlet distribution. We use the official code at https://github.com/xiangning-chen/DrNAS.

, a NAS method that provides a subnet sampled from a larger network (supernet) pretrained on ImageNet-1K, which alleviates the performance degeneration of prior supernet-based methods. We use the code at https://github.com/mit-han-lab/once-for-all.

[31], a meta-NAS model that rapidly generates data-dependent architecture for a given task that is meta-learned on subsets of ImageNet-1K. From the ImageNet-1K dataset and architectures of OFA search space, we randomly use 3296 and 14,000 meta-training tasks for the generator and predictor, respectively as a source database. [61] , a collection of convolutional models obtained via Differentiable Neural Architecture Search. We use FBNet-A pretrained on ImageNet-1K and fine-tune it on each meta-testing task for 50 epochs.

We use the same hyper-parameters for all baselines for a fair comparison. We fine-tune the architecture for 50 epochs on each meta-testing task. The SGD optimizer is used with a learning rate of 0.01, the momentum of 0.9, and 4e-5 weight decay. The image size is 224×224 and the batch size is 32. , and MetaD2A (about 0.5%p to 1.0%p higher). We observe that collecting more lightweight real-world neural network and dataset pairs (TANS w/ Real-world Model-Zoo) will allow our model to retrieve computationally efficient pretrained networks in a task-adaptive manner. Such data-driven nature is another advantage of our method since we can easily increase the performance of the model by collecting more pretrained networks that are readily available in many public databases. In the experiment introduced in the main document (Table 1) , we train DrNAS and PC-DARTS, which only generate architectures without pretrained weights, for 10 times more iterations (500 epochs) for a fair comparison (while the other methods, which share ImageNet pretrained knowledge, are trained for 50 epochs). In this experiment, rather than training for 500 epochs, we pretrain networks obtained by DrNAS and PC-DARTS on "ImageNet" and then fine-tune on two meta-test datasets (Colorectal Histology & Food Classification Datasets). As shown in Table 4 , although pretraining on ImageNet improves their results, our methods, including TANS with 1/10 sized model-zoo (1400), still outperforms all baselines, which shows that retrieving and utilizing pretrained weights of relevant tasks is more effective than using ImageNet pre-trained weights.

Not only the real-world architectures but also any existing NAS methods can be successfully integrated with our retrieval framework by simply adding searched networks into our model-zoo. We demonstrate such synergistic effect of TANS and NAS methods in Figure 6 (e) of the main document. Constructing the model-zoo with neural architectures generated by MetaD2A, which is a state-of-the-art NAS method, improves our performance compared to the previous model-zoo that are simply sampled from the OFA search space. Considering that NAS approaches have been actively studied [31, 52, 7, 51, 49, 15] and pretrained models are often shared via open-source, we believe that the TANS framework has powerful potential to continuously improve its performance by absorbing such new models into the model-zoo.

Our framework, TANS, has the following beneficial societal impacts: (1) enhanced accessibility, (2) preservation of data privacy, and (3) the reduction of reproducing efforts.

Enhanced accessibility Since our Task-Adaptive Neural Network Search (TANS) framework allows anyone to instantly retrieve a full neural network that works well on the given task, by providing only a small set of data samples, it can greatly enhance the accessibility of AI to users with little knowledge and backgrounds. Moreover, it does not require large computational resources, unlike existing NAS or AutoML frameworks, which further helps with its accessibility. Finally, to allow everyone to benefit from our task-adaptive neural network search framework, we will publicly release our model-zoo, which currently contains more than 15K models, and open-source it. Then, anyone will be able to freely retrieve/update any models from our model-zoo.

Preservation of data-privacy Our framework requires only a small set of sampled data instances to retrieve the task-adaptive neural network, unlike existing NAS/AutoML methods that require a large number of data instances to search optimal architectures for the target datasets. Thus, the data privacy is largely improved, and we can further allow the set encoding to take place on the client-side, rather than at the server. This will result in enhanced data privacy, as none of the raw data samples need to be submitted to the system.

Reduction of reproducing efforts Many ML researchers and engineers are wasting their time and labors, as well as the computational and monetary resources in reproducing existing models and fine-tuning them. TANS, since it instantly retrieves a task-relevant model from a model zoo that contains a large number of state-of-the-art networks pretrained on diverse real-world datasets, the users need not redesign networks or retrain them at excessive costs. Since we plan to populate the model zoo with more pretrained networks, the coverage of the dataset and architectures will become even broader as time goes on. Since training deep learning models often requires extremely large computing cost, which is costly in terms of energy consumption, and results in high carbon emissions, our method is also environment-friendly.

As a prerequisite condition, our method must have a model-zoo which contains pretrained models that can cover diverse tasks and perform well on each given task. There exists a chance that TANS could be affected by biased initialization if the meta-training pool contains biased pretrained models. To prevent this issue, we can use existing techniques that ensure fairness when constructing a model-zoo, which identify and discard inappropriate datasets or models. There have been various studies for alleviating unjustified bias in machine learning systems. Fairness can be classified into individual fairness, treating similar users similarly [16, 68] , and group fairness, measuring the statistical parity between subgroups, such as race or gender [70, 36, 22] . Optimizing fair metrics during training is achieved by regularizing the covariance between sensitive attributes and model predictions [59] and minimizing an adversarial ability to estimate sensitive attributes from model predictions [71] . At evaluation times, [3, 12] improves the generalizability for a fair classifier via two-player games. All these methods can be adopted when building our model-zoo. all datasets that we utilize are described (Due to the space limit, we provide hyperlinks to the webpage for the datasets, rather than printing the full website links.)

Microsoft azure custom vision

A reductions approach to fair classification

Lambdanetworks: Modeling long-range interactions without attention

A parallel global multiobjective framework for optimization: pagmo

SMASH: one-shot model architecture search through hypernetworks

High-performance large-scale image recognition without normalization

Once for all: Train one network and specialize it for efficient deployment

Pre-training tasks for embedding-based large-scale retrieval

Dr{nas}: Dirichlet neural architecture search

RENAS: reinforced evolutionary neural architecture search

Training well-generalizing classifiers for fairness metrics and other datadependent constraints

Imagenet: A large-scale hierarchical image database

An efficient framework for secure image archival and retrieval system using multiple secret share creation scheme

An image is worth 16x16 words: Transformers for image recognition at scale

Fairness through awareness

Meta-learning of neural architectures for few-shot learning

Finding beans in burgers: Deep semantic-visual embedding with localization

VSE++: improving visual-semantic embeddings with hard negatives

Model-agnostic meta-learning for fast adaptation of deep networks

End-to-end learning of deep visual representations for image retrieval

Equality of opportunity in supervised learning

Deep residual learning for image recognition

Approximation capabilities of multilayer feedforward networks

Multilayer feedforward networks are universal approximators

Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3

Mobilenets: Efficient convolutional neural networks for mobile vision applications

Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size

Imagenet classification with deep convolutional neural networks

Learning to balance: Bayesian meta-learning for imbalanced and out-of-distribution tasks

Rapid neural architecture search by learning to generate graphs from datasets

Stacked cross attention for image-text matching

Meta-learning with differentiable convex optimization

Towards fast adaptation of neural architectures with meta learning

Darts: Differentiable architecture search

The variational fair autoencoder

Nsganetv2: Evolutionary multi-objective surrogate-assisted neural architecture search

Neural architecture optimization

Shufflenet v2: Practical guidelines for efficient cnn architecture design

On first-order meta-learning algorithms

Empirical Performance of the Approximation of the Least Hypervolume Contributor

Efficient neural architecture search via parameter sharing

Regularized evolution for image classifier architecture search

A comprehensive survey of neural architecture search: Challenges and solutions

Mobilenetv2: Inverted residuals and linear bottlenecks

Meta architecture search

Very deep convolutional networks for large-scale image recognition

Prototypical networks for few-shot learning

Jonathon Shlens, Pieter Abbeel, and Ashish Vaswani. Bottleneck transformers for visual recognition

Going deeper with convolutions

Rethinking model scaling for convolutional neural networks

Efficientnetv2: Smaller models and faster training

Mnasnet: Platform-aware neural architecture search for mobile

A semi-supervised assessor of neural architectures

Learning to Learn

Matching networks for one shot learning

On the limitations of representing functions on sets

Cross-modal scene graph matching for relationship-aware image-text retrieval

Learning nondiscriminatory predictors

Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search

FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search

Aggregated residual transformations for deep neural networks

Approximate nearest neighbor negative contrastive learning for dense text retrieval

How powerful are graph neural networks?

Pc-darts: Partial channel connections for memory-efficient architecture search

Deep multi-view enhancement hashing for image retrieval

Does unsupervised architecture representation learning help neural architecture search?

Training individually fair ml models with sensitive subspace robustness

Deep sets

Learning fair representations

Mitigating unwanted biases with adversarial learning

D-vae: A variational autoencoder for directed acyclic graphs

Deep reinforcement learning for information retrieval: Fundamentals and advances

Deep supervised cross-modal retrieval

Econas: Finding proxies for economical neural architecture search

Neural architecture search with reinforcement learning

Learning transferable architectures for scalable image recognition

Organization In Appendix, we provide detailed descriptions of the materials that are not fully covered in the main paper, and provide additional experimental results, which are organized as follows:• Section A -We describe the implementation details of our model-zoo construction, query and model encoders, and meta-surrogate performance predictor.• Section B -We provide the details of the model training, such as the learning rate and hyper-parameters, of meta-train/test and constructing model-zoo.• Section C -We provide the proof of injectiveness with the proposed query and model encoding functions over the cross-modal latent space.• Section D -We elaborate on the detailed experimental setups, such as architecture space, baselines, and datasets, corresponding to the experiments introduced in the main document.• Section E -We provide additional analysis of the experiments introduced in the main document and present the experiment with different model-zoo settings.• Section F -We discuss the societal impact and the limitation of our work.

A.1 Efficient Model Zoo Construction

Input : D, M: collection of datasets and models, respectively, 1] : set of Ninit initial tuples of (dataset, model, test accuracy) t ← 0; while termination condition is not met do if t is divisible by Ntrain then Train accuracy predictor parameters ψzoo on data Z (t) ;; α * ← Evaluate the actual accuracy of (D, M ) by training M on D;The algorithm that is used to efficiently construct the model-zoo is described in Algorithm 1. where S indicates the accuracy predictor, and gD is the normalized volume under the pareto-dominated pairs for dataset D:wheres latency (M ) andsparameters(M ) indicates the normalized latency and the normalized number of parameters of the model M , respectively. The latency and parameters are normalized so that the maximum value across all models becomes 1.0 and the minimum value becomes 0.0. The hypervolume can be computed efficiently with the PyGMO library [5] .The accuracy predictor used in the model-zoo construction is very similar to the structure of the performance predictor described in Section A.4, but we used a functional embedding obtained from the model pretrained on Imagenet-1K, instead of a functional embedding obtained from a model already trained on the target dataset, since training the model on the target dataset just to obtain the functional embedding would defeat the purpose of this algorithm. Also, to incorporate uncertainty about the accuracy predictions, we use 10 samples from the accuracy predictor with MC dropout to evaluate the expectation in (8) . The dropout probability is set to 0.5.

Our query encoder takes sampled instances (e.g. 10 unseen random images per class) as an input from the query dataset. We use image embeddings from ResNet18 [23] pretrained on ImageNet 1K [13] , whose dimensions