key: cord-0605507-z6birpdi
authors: Wang, Jindong; Feng, Wenjie; Liu, Chang; Yu, Chaohui; Du, Mingxuan; Xu, Renjun; Qin, Tao; Liu, Tie-Yan
title: Learning Invariant Representations across Domains and Tasks
date: 2021-03-03
journal: nan
DOI: nan
sha: ab2a770ddcd58f0bf9a5056c287330adcdfe8c1f
doc_id: 605507
cord_uid: z6birpdi

Being expensive and time-consuming to collect massive COVID-19 image samples to train deep classification models, transfer learning is a promising approach by transferring knowledge from the abundant typical pneumonia datasets for COVID-19 image classification. However, negative transfer may deteriorate the performance due to the feature distribution divergence between two datasets and task semantic difference in diagnosing pneumonia and COVID-19 that rely on different characteristics. It is even more challenging when the target dataset has no labels available, i.e., unsupervised task transfer learning. In this paper, we propose a novel Task Adaptation Network (TAN) to solve this unsupervised task transfer problem. In addition to learning transferable features via domain-adversarial training, we propose a novel task semantic adaptor that uses the learning-to-learn strategy to adapt the task semantics. Experiments on three public COVID-19 datasets demonstrate that our proposed method achieves superior performance. Especially on COVID-DA dataset, TAN significantly increases the recall and F1 score by 5.0% and 7.8% compared to recently strong baselines. Moreover, we show that TAN also achieves superior performance on several public domain adaptation benchmarks.

The COVID-19 pandemic is greatly threatening global public health. In the battle against COVID-19, one critical challenge is to diagnose patients among a large number of people and provide necessary medical treatment so as to prevent further spread of the virus. Nowadays there is a growing trend to use the screening of chest radiography images (CRIs) such as X-ray images [42] for automated computer-aid diagnosis.

The diagnosis of COVID-19 based on chest X-ray images is a standard image classification problem with two classes: the infected and disinfected ones. While deep neu- (b) Our method gives correct predictions by adapting both feature distributions and task semantics. Attention map shows that our method can capture the critical factors [42] in the image that help detection. ral networks (DNNs) have achieved great success for image classification, they often require a large amount of labeled images for training. Unfortunately, for COVID-19, largescale annotations are costly and time-consuming to collect. Therefore, a straightforward approach is to leverage transfer learning (TL) techniques [23, 44, 25, 45] to transfer knowledge from existing (abundant) typical pneumonia datasets (i.e., the source domain) to COVID-19 (i.e., the target domain) to facilitate the model learning. In this paper, we mainly focus on the most challenging transfer setting where (1) the target domain has no labels and (2) the labels in the source and target domains are of different semantic meanings. We call this setting unsupervised task transfer.

In this unsupervised setting, the standard pretrainfinetune transfer paradigm becomes inapplicable, as there are no labeled images available in the target domain for finetuning. This requires us to conduct unsupervised adap-tation between two different tasks, i.e., train a model on the labeled source domain and adapt it to the target domain in an unsupervised manner. Unsupervised adaptation presents two critical challenges that can result in negative transfer [25] and produces even worse performance than no transfer. The first challenge is the feature distribution divergence, which naturally exists since the distribution of these features differs from the source to the target domain. Hence, feature distribution adaptation is necessary. The second challenge is the task semantic difference since diagnosing pneumonia and COVID-19 are two related but different tasks that have different preferences on the critical factors [42] . Thus, the task semantics should also be adapted to maximize the transfer performance. While existing domain adaptation (DA) methods are able to adapt feature distributions when the source and target tasks are identical (e.g., both of the source and target domains are classifying monitors under different background), they are not applicable to our problem [7, 50, 18, 4, 37] .

In this paper, we propose a Task Adaptation Network (TAN) for this unsupervised task transfer problem. The concept of our method is illustrated in Figure 1 (a). TAN is able to learn transferable features across domains and tasks. Concretely speaking, TAN firstly adopts domainadversarial training to reduce the feature distribution divergence between domains. However, only adapting features is not sufficient due to the task semantic difference. TAN devises a novel permutation-invariant task semantic adaptor that uses the learning-to-learn strategy to handle task semantic difference. We design a feature-critic training algorithm that effectively adapts task semantics using the pivot data. Figure 1(b) gives an example of the activation map of TAN to show its effectiveness in finding the critical factors [42] by adapting both feature distribution and task semantics.

To sum up, this paper makes the following contributions:

1. We propose a novel Task Adaptation Network (TAN) for unsupervised task transfer that addresses the feature distribution divergence and task semantic difference. Especially for the challenging task semantic adaptation, we propose a novel task semantic adaptor that leverages the learning-to-learning strategy to adapt cross-domain tasks.

2. Experiments on three public COVID-19 chest X-ray image classification datasets demonstrate that TAN outperforms several state-of-the-art baselines. To be more specific, on the challenging COVID-DA dataset, TAN significantly improves the F1 score and recall by 7.8% and 5.0% respectively, compared to the second best baseline.

3. Moreover, TAN is a general and flexible method that also achieves superior performance on sev-eral public domain adaptation benchmarks including ImageCLEF-DA, Office-Home, and VisDA-2017.

Transfer learning (TL) [25] is a useful technology to transfer the knowledge from existing source domains to the target domain, especially when the target domain has sparse or no labels. Such label scarcity problem can be solved using TL by firstly pretraining on a large dataset such as ImageNet [8] and then finetune the pretrained model on downstream tasks. This strategy is widely used in modern computer vision research [23, 36, 45, 44, 9] . In a semi-supervised setting where the target domain has labels, Luo et al. [22] proposed a domain and task transfer network to handle the different tasks using task semantic transfer. However, when the target domain has no labels, the pretrain-finetue paradigm is not available.

When two tasks are related, multi-task learning (MTL) [5, 17, 33] can be used to learn transferable features to enhance their learning performance. MTL also works when different tasks have labels available. Meta-learning, or learning-to-learn [2, 10, 31] aims to learn general knowledge from a bunch of tasks and then transfer to unseen tasks. Meta-learning often works under the few-shot setting where different tasks have several labels available and does not explicitly reduce the feature distribution divergence between domains and tasks. Zero-shot learning (ZSL) [24, 34] focuses on classifying all unseen classes which can be seen as a general case to our problem that has one unseen category. In contrast, ZSL does not reduce the distribution divergence across domains.

Domain adaptation (DA) is a specific area of transfer learning [25] . DA aims at building cross-domain models by reducing the distribution divergence of a representation via some divergence measured by such as Maximum Mean Discrepancy (MMD) [14] , KL or JS divergence, cosine similarity, and higher-order moments [46, 18, 38, 35, 49, 41] . Another line of work relies on the generative adversarial nets [13] to learn domain-invariant features [50, 29, 37, 11] . While great progress has been made, directly applying DA to our problem is not sufficient since DA generally works when two domains have identical categories. The setting where two domains have different but overlapped tasks is also explored in recent open set DA [26, 30] and partial DA [47] methods. However, their purpose is to recognize the overlapped (common) categories rather than the unshared classes in the target domain.

We introduce an unsupervised learning model which transfers information from a large labeled source domain, S, to a target domain, T , across different tasks. The goal be- Figure 2 . The architecture of the proposed TAN method that consists of four modules: feature extractor F φ , classifier C ψ , domain discriminator Dω for feature distribution adaptation, and task semantic adaptor M θ for task semantic adaptation.

ing to learn a strong transferable target classifier h : X T → Y T that reduces both the feature distribution divergence and task semantic difference.

We assume the source domain contains n s images, x s ∈ X S , with associated labels, y s ∈ Y S . The target domain consists of n t unlabeled images, x t ∈ X T . S and T share the same feature space, i.e. X S = X T .

Unlike traditional domain adaptation approaches that assume a cross-domain distribution shift under a shared label space (Y S = Y T ), we aim to adapt both the feature distribution and task semantics, i.e., we consider the case where the tasks corresponding to source and target spaces are only similar but not identical, Y S = Y T , like for the pneumonia and COVID-19. Even if these two classification problems can be treated as classifying images into {0, 1}, their semantics are still different. 1

In this paper, we propose a novel Task Adaptation Network (TAN) to adapt the feature distribution and task semantics. We depict our overall model in Figure 2 . TAN consists of four modules: feature extractor F φ , classifier C ψ , domain classifier D ω , and task semantic adaptor M θ with φ, ψ, ω, and θ as the learnable parameters. Taking as inputs the labeled source examples, TAN learns a latent feature space with F φ and the binary classification network C ψ by standard supervised learning. Then, by also taking as inputs the unlabeled target domain, TAN adapts the feature distributions and task semantics with the domain classifier D ω and task semantic adaptor M θ , respectively.

Our model jointly optimizes over a source classification loss L cls , a feature distribution adaptation loss L f eat , and a task semantic transfer objective L task . Thus, the total objective of TAN can be formulated as follows:

where the hyperparameters λ and µ determine the influence of the feature distribution adaptation and task semantic adaptation, respectively. We use the cross-entropy loss 1 Although we can formulate both binary classification problems using Y = {0, 1} for each, the semantic of label "1", i.e. the "true label", differs.

(CE) to measure the classification error L cls on the labeled source domain:

where E denotes the expectation. In the following sections, we will elaborate on the feature distribution adaptation and task semantic transfer modules.

Considering the similarity between different tasks in S and T , the feature distribution adaptation module aims to reduce the cross-domain feature distribution divergence. Inspired by the well-established domain-adversarial training [11, 12] , TAN designs a domain discriminator D ω to adapt the cross-domain feature distributions. Domainadversarial training is a two-player game where the domain discriminator D ω is trained to distinguish the source domain from the target domain, while the feature extractor F φ tries to confuse the domain discriminator by learning domain-invariant features. These two players are trained adversarially, i.e., ω is trained by maximizing the loss on domain discriminator D ω while φ, ω are trained to minimize the loss of the feature extractor which is computed by the domain classifier. This optimization procedure eventually minimizes the difference between the feature distributions on the two domains, measured by the Jensen-Shannon divergence [13] . The adversarial training loss for feature distribution adaptation can be formulated as:

Domain-adversarial training is insufficient for our problem since it learns domain-invariant features across domains regardless of the tasks. In our problem, even if the source and target tasks are similar which makes domainadversarial training reasonable, these tasks are still not identical. Adapting feature distributions unnecessarily ensures the adaptation across different task semantics. Since features are highly correlated with the tasks, the difference in task semantics would harm the feature adaptation and the performance of knowledge transfer.

On this problem, we propose a task semantic adaptor M θ that employs a learning-to-learn strategy to adapt task semantics by learning from the domain-adversarial features. Learning-to-learn, or meta-learning [2, 10, 31] aims to effectively leverage the datasets and prior knowledge of a task ensemble in order to rapidly learn new tasks often with a small amount of data. Therefore, since the task semantics are difficult to model, we turn to using the learning-tolearning strategy. The key idea is to let M θ learn the adaptation ability from domain-adversarial training and then such ability can be utilized for task adaptation. Therefore, when domain-adversarial training gradually encourages the features to be domain-invariant, the task semantic adaptor M θ can gradually learn this ability using the learning-to-learn strategy and eventually also enforces the features to be taskinvariant. Hence, the task semantics can be adapted. Technically speaking, the task semantic adaptor M θ is implemented as an MLP network that can theoretically approximate any continuous functions [6] to enable the superior adaptation. Denoting F s φ , F t φ as the source-and targetdomain features extracted by F φ , the task semantic adaptation loss L task can be formulated as:

Unfortunately, it remains challenging to optimize the above equation w.r.t. θ for three reasons. Firstly, what property should M θ satisfy to ensure that the task semantics can be adapted? Secondly, how to maximally utilize the domain-invariant representations learned by adversarial training to better couple with the feature extractor for more effective training? Thirdly, how to update the adaptor network parameters in training? Permutation invariance. The task semantic adaptation is supposed to reduce the difference between two tasks, so it should be invariant to permutations of samples that represent each task distribution. Therefore, we design M θ to be permutation-invariant to the rows of its inputs, i.e., it should make no difference for different sample indices like the cases [1, 2, 3] and [3, 2, 1]. To enforce this property, we let M θ take as inputs the pairwise distance between each element of F s φ and F t φ , which is permutation-invariant [16] . Then, the task semantic loss can be represented as:

where Gram denotes the Gram matrix computed by pairwise distance, Flatten is a flatten operation, and MLP denotes a multi-layer perceptron.

Pivot data. In a supervised case where target domain has labels, the task adaptor M θ can be learned easily. However, it becomes challenging in this unsupervised setting.

To maximally learn the adaptation ability from domainadversarial training, M θ is updated on a pivot data P. It is a selected subset of both the source and target domains to maximally utilize the domain-adversarial training. The pivot data are the data that with high confidence scores during the learning process, so they can be representatives of the domain-adversarial training. As features are getting more domain-invariant, the classification performance on the target domain is gradually better, i.e., the pseudo labels for the target domain is getting more confident. This pseudo-label training is widely adopted by most transfer learning literature [50, 37] . Therefore, based on the assumption that the adaptation ability can be learned from the domain-adversarial training, the task semantic adaptation could be better learned if M θ can directly learn from samples with the most confident pseudo labels, i.e., the pivot data. The number of pivot data is important to our problem: less pivot data will bring more confidence and less generalization while more pivot data will do the opposite, which is empirically evaluated in later experiments.

To be more specific, the pivot data can be represented as:

where the data pairs are sorted as {(x j ,ŷ j )} in decreasing order of prediction score.ŷ is the predicted (pseudo) label on the target domain and c is the class index. We select the top m instances for each class with high prediction scores (softmax probability that does not need the target labels). This selection is iterated in the whole learning process. For the source domain, we directly use its ground-truth labels.

In total, we select m · |Y| pivot data for each domain.

Feature-critic training. Different from classification, there is no supervision information for the target domain that makes it hard to update M θ . In this paper, we propose a feature-critic training strategy [16] to update the task adaptor. For notation brevity, we pack Φ = {φ, ψ, ω} for parameters other than θ.

Our feature-critic training is introduced as follows. Let Φ(t) and Φ(t + 1) denote parameter values in two consecutive learning steps t and t + 1, respectively. Our key assumption is that as the pseudo labels on pivot data are getting more confident, if Φ(t + 1) is better than Φ(t) for task adaptation, it should produce lower risks and better classification performance. Therefore, a reasonable feature-critic metric M θ should evaluate a lower value for Φ(t + 1) than for Φ(t). We thus update the feature-critic metric M θ by minimizing difference L val computed by Φ(t) and Φ(t+1): (7) where σ(·) is an activation function.

The training process consists of two steps: 1) update Φ for the feature extractor, classification layer and domain discriminator and 2) update θ for the task semantic adaptor.

Update Φ. This step is to update Φ for classification and domain-adversarial training. To enforce the update of θ in the next step, we construct an assist model which is a copy Build an assist model with its parameter inherited from the main model.

For mini-batch data B s , B t in S, T do Select the data with the highest prediction confidence from T to construct pivot data P. (1) on all the labeled source domain and unlabeled target domain data. Note that for updating the domain discriminator D ω , we do not use the mini-max optimization process and instead follow [11] to use the Gradient Reversal Layer for computing efficiency. Therefore, ω can be updated together with φ and ψ in a single back-propagation.

After getting the training loss, denote α the learning rate of the main model, then Φ can be updated by:

Update θ. This step is to update θ for the task adaptor M θ using feature-critic training on the pivot data P. Denote β its learning rate, then, θ can be updated by taking derivative of L val on θ:

where Φ(t) and Φ(t + 1) are parameters of the assist and main model, respectively.

The above two steps are used iteratively since the pseudo labels of the pivot data can be more confident and all the losses can be minimized. In our experiments, we observe that the network will converge in dozens of epochs.

The training process of TAN is listed in Algorithm 1 and Figure 3 . As for inference, we fix Φ to perform a single forward-pass to get the classification results for the test data.

We evaluate TAN on public COVID-19 chest X-ray datasets. COVID-DA [51] contains three categories: typical Figure 3 . The learning process of TAN is composed of two iterative steps: update Φ and update θ. For COVID-DA, we follow [51] to construct the source, target, and validation domains. For Bacterial and Viral datasets, we construct their source and target domains by taking all pneumonia (for source) and all COVID-19 (for target) samples accordingly. As for the normal category, we split them evenly into two domains. We further leave 20% of the target domain for validation. Eventually, there are two classes in the source domain: normal and pneumonia; and there are normal and COVID-19 classes in the target and validation datasets. In this case, the source domain does not contain any COVID-19 samples, which makes our problem harder than traditional transfer learning and domain adaptation. The detailed domain split information is presented in appendix.

We compare the performance of TAN with three categories of methods: (1) deep and traditional transfer learning baselines, (2) deep diagnostic methods, and (3) unsupervised DA methods.

The deep and traditional TL baselines include: Pretrainonly, which trains a network on the source domain, and then directly apply the pretrained model on the target domain. Target-train, which is an ideal state and only for comparison since there are no labels for the target domain. We directly use several extra labeled COVID-19 data from the dataset (they are 30% of the target domain data) and train a network on these data. Then, we apply prediction on the target data. Pretrain-finetune, which is a standard TL paradigm [11] , MCD [29] , CDAN+TransNorm [43] , MDD [50] , and BNM [7] . All methods are using ResNet-18 [15] as the backbone network following [51] . The results of these methods are obtained from [51] to ensure a fair comparison. Note that we do not compare with [51] since it is a semi-supervised method that requires labeled data on the target domain.

For TAN, we use the mini-batch SGD with Nesterov momentum of 0.9 to optimize the main and the meta-network with batch size set as 16. The learning rate α of the main model changes by following [11] : α k = α (1+γk) −υ , where k is the training iteration, γ = 0.001, α = 0.004, and decay rate υ = 0.75. The learning rate β for M θ is set to be 0.0005. M θ uses a in − 128 − 64 − 1 MLP structure where in is the dimension of input matching features. We grid search the value of λ and µ in the range [0.01, 0.05, 0.1, 0.5, 1, 5, 10] for the best performance.

During training, the labels of the target domain are not available and they are only be used for evaluation. For this binary classification problem, we use F1, Precision (P), and Recall (R) as the evaluation metrics. We do not use ROC/AUC since we are more interested in the recall and F1 in this specific disease diagnosis problem. The results are the average accuracy of ten trials.

The results on COVID-DA dataset are presented in Table 2. Here we use the 95% confidence interval, where the corresponding value of z is 1.96. The computed confidence interval r is around 1.3%. Note that we do not list the accuracy results since all methods achieve similarly high accuracy values in this binary classification problem. On COVID-DA dataset, our proposed TAN achieves a recall of 80.0% and F1 of 75.0%, which significantly outperforms the second best baselines by 5.0% in recall and 7.8% in F1 score. Pretrain-only and Finetune do not get good performance because of the feature distribution divergence and task semantic difference, which drastically limit their performance. Compared to other DA methods (DANN, MCD, CDAN, MDD and BNM), TAN also achieves better recall and F1 score. While the precision of CDAN is better than ours, its recall and F1 is not competent. Compared to the ideal state which is training on labeled target data, it is surprising to find that even working in fully unsupervised setting, our proposed TAN can achieve better recall and F1 score.

The results on Bacteria and Viral datasets are shown in Table 3 and Table 4 , respectively. In these two datasets, we do not compare with the two ideal states target-train and pretrain-finetune since their performances are consistently better due to increased COVID-19 samples in these datasets. Our TAN still significantly outperforms other baselines in these large datasets. Same as on the COVID-DA dataset, pretrain-only gives the worst results due to feature distribution divergence and task semantic difference. While domain adaptation methods (DAN, DANN, MDD and BNM) outperform pretrain-only by aligning the feature distributions, they do not adapt the task semantics.

Comparing the results from all three tables, we see that with the numbers of unsupervised target domain samples increase, the transfer learning performance tends to become better, i.e., the results in Table 3 and Table 4 are generally better than the results in Table 2 . This is because more representative knowledge can be learned when domains have 

To further evaluate the effectiveness of TAN, we conduct an ablation study in Table 5 . For task adaptation, we remove the domain-adversarial training module and directly let the model learn from the classification loss. For feature adaptation, we simply remove the task adaptation module. The results show the following observations. Firstly, the classification loss alone does not get good results, indicating the existence of feature distribution divergence and task semantic difference. Secondly, better performance can be achieved by combining the feature distribution adaptation and task semantic adaptation modules, indicating that both adaptation modules are effective. Thirdly, the best performance is achieved by combining both of the feature adaptation and task semantic adaptation modules, which proves that both of them are important in this problem. 

The change of lung is critical factor for diagnosing COVID-19, which could be visualized to study the effectiveness of our method. Therefore, we visualize the atten-tion maps for several COVID-19 images using the Gradientweighted Class Activation Mapping (Grad-CAM) [32] in Figure 4 . The shadow area in the figures is the lung area and the heat map denotes the activation weights for the model. Specifically, Figure 4(a) shows the cases where all adaptation modules give correct predictions, while Figure 4 (b) shows the cases where wrong predictions are given by only feature distribution adaptation and only task semantic adaptation, and correct predictions are given by adapting both the feature distributions and task semantics.

From these results, we observe that in general cases where the samples are easy to classify, all adaptation methods give correct prediction. However, when the samples are hard such as the second one in Figure 4 (b) which is sideways, it becomes harder for the model to classify. In this case, only adapting feature distributions or task semantics are not sufficient. Our proposed TAN can perform reasonably well in all situations.

Pivot data We empirically analyze size m of the pivot data P. It is obvious that a larger m will bring more uncertainty, and a smaller m is likely to make the meta-network unstable. We record the performance of TAN using different values of m on COVID-DA dataset in Figure 5(a) . The results indicate that TAN is robust to m and a small m can lead to competitive performance. Therefore, we set m = 8 in experiments for computational efficiency. We also compare different pivot data selection strategies in the appendix. Permutation invariance We evaluate the permutation invariance property of M θ . Figure 5(b) shows the two random results of using Gram matrix as inputs (i.e., permutationinvariant) and the raw features (i.e., the permutation-variant case, which we denote as 'plain' in the figure) . While both networks outperform other baselines in Table 2 , the Gram matrix gets the best performance, indicating that permutation invariance is important in the task semantic adaptor.

Feature distribution We compute the feature distribution distance using the maximum mean discrepancy (MMD) [3] as shown in Figure 5 (c). The results indicate that our proposed TAN can gradually reduce the distribution distance.

Convergence and parameter sensitivity analysis We record the loss for classification and task semantic adaptation in Figure 5 (d). The results show that although TAN involves both the classification and task adaptation networks, it can quickly reach a steady performance. This makes it easy to train in real applications. We also empirically analyze the sensitivity of the two trade-off parameters λ and µ and present the results in appendix, which shows that our method is relatively robust to these parameters.

Although our main focus is COVID-19, TAN is not limited to this problem. In fact, TAN can be applied to any dataset with similar setting and standard DA datasets.

Constructed dataset. We construct a new dataset from the VisDA-2017 DA challenge [28] by selecting its two random classes. In this dataset, the source domain is rendered 3D objects and the target domain is natural images. We call this constructed dataset VisDA-binary. Specifically, the source domain contains two classes: train (16,000) and truck (9,600) and the target domain contains train (4,236) and bus (4, 690) . The goal is to maximize the binary classification performance on the target domain, especially on class bus. This dataset is more balanced with more samples than COVID-19, which can be regarded as its complement. The results in Table 6 show that TAN outperforms all other comparison methods in recall and F1 score.

Standard DA banchmark. We also evaluate the performance of TAN on several standard domain adaptation benchmarks. Table 7 shows the results of TAN against other strong baselines on the ImageCLEF-DA [1] dataset. We see that although TAN is not specifically designed for traditional DA tasks, it still achieves competitive performance. We show the results on Office-Home [40] dataset in appendix where TAN also produces competitive performance.

In this paper, we propose a Task Adaptation Network (TAN) for COVID-19 chest X-ray image classification by transferring knowledge from the typical pneumonia. TAN can adapt both of the cross-domain feature distributions and task semantics to produce accurate prediction on the target domain. Specifically for task semantic adaptation which is hard to model, we design a semantic adaptor that leverages the learning-to-learn strategy to learn the adaptation ability from the domain-adversarial training. Experiments on several public datasets show that TAN significantly outperforms other comparison approaches. Moreover, TAN can also achieve competitive performance on several domain adaptation benchmarks.

In the future, we plan to apply TAN to more fine-grained COVID-19 diagnosis tasks such as detection and segmentation. TAN can also be applied to other COVID-19 data modalities like CT scans. In addition, we also plan to apply TAN to other similar transfer learning problems.

In our main experiments, we set the m = 8 for the pivot data to select the top m samples belonging to one class with the highest probabilities. For pivot data construction, we further evaluate the performance of other two strategies: (1) select random m samples for each class and (2) select the bottom m samples for each class.

Here, 'bottom m' is the opposite of top m in the main paper, which is selecting the m samples with the lowest probabilities. The results in Figure 6 show that the top m strategy achieves the best performance.

Random Bottom 

As for evaluation metrics, other than precision, recall, and F1 score, we further draw the ROC curve in Figure 7 . The results show that our proposed TAN can achieve superior performance on this binary classification problem. We also compute the AUC (Area Under Curve) of our method and compare it with other baselines in Figure 8 . The results show that other than target-train, which is the ideal state that uses the target domain labels for training, our method achieves the best AUC values compared to all other baselines.

In this section, we pay special attention to the design criteria of the task semantic adaptation network M θ . We ex- tensive analyze the following aspects: 1) network structure, 2) training criterion, and 3) activation function. Thorough analysis of these properties is valuable for designing better network and learning strategies in similar problems to ensure better performance.

We design different structures of M θ and record its performance in Table 9 . As shown in the results, different structures produce different results and all results are better than comparison methods in Table 2 of the main paper. This means that the MLP structure of M θ is effective. In general, complex structure is better at learning meaningful representations. In contrast, a simple structure may be worse in feature learning, but more difficult to overfit. Based on our experiments, we choose the structure in − 128 − 64 − 1 for M θ .

With regarding to the training criterion, we select different learning rates for the task semantic adaptor M θ in [0.0001, 0.0005, 0.001, 0.005, 0.01] and record the performance in Figure 9 . The results indicate that M θ can achieve similar performance with different learning rates. Therefore, to achieve the best performance we use the learning rate 0.0005. 

We show the performance of different activation functions (i.e., σ in Eq. (7) of the main paper) in Table 10 . The results show that tanh can generally lead to better per- Table 9 . formance for our problem while sigmoid can also achieve good results. Therefore, we use tanh in this work. 

For Bacterial and Viral datasets, the detailed information on source and target domain split in Table 11 . The split information of COVID-DA is omitted which can be seen in [51] It is clear that the COVID-19 samples are not seen during training, which makes our unsupervised task transfer problem really challenging.

We also note that although the number of viral pneumonia samples is smaller than bacterial pneumonia samples, the transfer learning performance on viral dataset is almost the same as bacterial dataset (cf. Table 3 and 4 in the main paper, 95.2 vs. 95.4 F1 score). This maybe due to the high similarity between viral pneumonia and COVID-19 pneumonia as COVID-19 is also caused by a certain type of virus called "SARS-CoV-2". Again, this reflects the fact that similarity matters in transfer learning. 

The results on Office-Home [40] dataset is shown in Table 8, and the results on VisDA-2017 dataset that uses all the classes are in Table 12 . Note that TAN does not focus on the standard domain adaptation tasks, thus, we do not compare it with the latest DA methods. These results show that although the proposed TAN is not tailored for traditional domain adaptation tasks, it still achieves competitive results compared to several strong baselines. In the future, it is available to develop new TAN-based methods for traditional DA tasks to increase its performance. Table 12 . Accuracy (%) on VisDA-2017 dataset (ResNet-50) Method syth → real ResNet [15] 52.4 DAN [18] 61.1 DANN [11] 57.4 JAN [21] 61.6 TAN (Ours) 64.0

We show in Figure 10 that the two parameters λ, µ are relatively robust to different values, making our proposed method easy applicable to real applications. 

On the optimization of a synaptic learning rule

Integrating structured biological data by kernel maximum mean discrepancy

Unsupervised pixellevel domain adaptation with generative adversarial networks

Multitask learning

Approximation with artificial neural networks

Towards discriminability and diversity: Batch nuclear-norm maximization under label insufficient situations

Imagenet: A large-scale hierarchical image database

Decaf: A deep convolutional activation feature for generic visual recognition

Modelagnostic meta-learning for fast adaptation of deep networks

Unsupervised domain adaptation by backpropagation

Domain-adversarial training of neural networks

Generative adversarial nets

A kernel two-sample test

Deep residual learning for image recognition

Feature-critic networks for heterogeneous domain generalization

End-toend multi-task learning with attention

Learning transferable features with deep adaptation networks

Conditional adversarial domain adaptation

Deep transfer learning with joint adaptation networks

Deep transfer learning with joint adaptation networks

Label efficient learning of transferable representations acrosss domains and tasks

What is being transferred in transfer learning?

Zero-shot task transfer

A survey on transfer learning

Open set domain adaptation

Multi-adversarial domain adaptation

Visda: The visual domain adaptation challenge

Maximum classifier discrepancy for unsupervised domain adaptation

Open set domain adaptation by backpropagation

Meta-learning with memory-augmented neural networks

Grad-cam: Visual explanations from deep networks via gradient-based localization

Multi-task learning as multi-objective optimization

Zero-shot learning through cross-modal transfer

Deep coral: Correlation alignment for deep domain adaptation

Meta-transfer learning for few-shot learning

Adversarial discriminative domain adaptation

Deep domain confusion: Maximizing for domain invariance

Curated dataset for covid-19 posterior-anterior chest radiography images (xrays)

Deep hashing network for unsupervised domain adaptation

Visual domain adaptation with manifold embedded distribution alignment

Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest radiography images

Transferable normalization: Towards improving transferability of deep neural networks

How transferable are features in deep neural networks? In NIPS

Taskonomy: Disentangling task transfer learning

Central moment discrepancy (cmd) for domain-invariant representation learning

Importance weighted adversarial nets for partial domain adaptation

Covid-19 screening on chest x-ray images using deep learning based anomaly detection

Collaborative and adversarial network for unsupervised domain adaptation

Bridging theory and algorithm for domain adaptation

Deep domain adaptation from typical pneumonia to covid-19