key: cord-0916881-6sav923s
authors: Zhou, Tao; Fan, Deng-Ping; Cheng, Ming-Ming; Shen, Jianbing; Shao, Ling
title: RGB-D salient object detection: A survey
date: 2021-01-07
journal: Comput Vis Media (Beijing)
DOI: 10.1007/s41095-020-0199-z
sha: 1ac629670df4266f802b60415b26dd332b2af9e2
doc_id: 916881
cord_uid: 6sav923s

Salient object detection, which simulates human visual perception in locating the most significant object(s) in a scene, has been widely applied to various computer vision tasks. Now, the advent of depth sensors means that depth maps can easily be captured; this additional spatial information can boost the performance of salient object detection. Although various RGB-D based salient object detection models with promising performance have been proposed over the past several years, an in-depth understanding of these models and the challenges in this field remains lacking. In this paper, we provide a comprehensive survey of RGB-D based salient object detection models from various perspectives, and review related benchmark datasets in detail. Further, as light fields can also provide depth maps, we review salient object detection models and popular benchmark datasets from this domain too. Moreover, to investigate the ability of existing models to detect salient objects, we have carried out a comprehensive attribute-based evaluation of several representative RGB-D based salient object detection models. Finally, we discuss several challenges and open directions of RGB-D based salient object detection for future research. All collected models, benchmark datasets, datasets constructed for attribute-based evaluation, and related code are publicly available at https://github.com/taozh2017/RGBD-SODsurvey.

Salient object detection aims to locate the most visually prominent object(s) in a given scene [1] .

It plays a key role in a range of real-world applications, such as stereo matching [2] , image understanding [3] , co-saliency detection [4] , action recognition [5] , video detection and segmentation [6] [7] [8] [9] , semantic segmentation [10, 11] , medical image segmentation [12] [13] [14] , object tracking [15, 16] , person re-identification [17, 18] , camouflaged object detection [19] , image retrieval [20] , etc. Although significant progress has been made in the salient object detection field over the past several years [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] , there is still room for improvement when faced with challenging factors, such as complicated backgrounds or varying lighting conditions in the scenes. One way to overcome such challenges is to employ depth maps, which provide complementary spatial information to that from RGB images and have become easier to capture due to the ready availability of depth sensors (e.g., Microsoft Kinect).

Recently, RGB-D based salient object detection has gained increasing attention, and various methods have been developed [38, 45] . Early RGB-D based salient object detection models tended to extract handcrafted features and then fused the RGB image and depth map. For example, Lang et al. [46] , the first work on RGB-D based salient object detection, utilized Gaussian mixture models to model the distribution of depth-induced saliency. Ciptadi et al. [47] extracted 3D layout and shape features from depth measurements. Several methods [48] [49] [50] measure depth contrast using depth differences between different regions. In Ref. [51] , a multicontextual contrast model including local, global, and background contrast was developed to detect salient objects using depth maps. More importantly, however, this work also provided the first largescale RGB-D dataset for salient object detection. Despite the effectiveness of traditional methods using handcrafted features, their low-level features tend [36] and SE [37] , and seven stateof-the-art deep models: D 3 Net [38] , SSF [39] , A2dele [40] , S 2 MA [41] , ICNet [42] , JL-DCF [43] , and UC-Net [44] .

to limit generalization ability, and they lack the high-level reasoning required for complex scenes. To address these limitations, several deep learningbased RGB-D salient object detection methods [38] have been developed, with improved performance. DF [52] was the first model to introduce deep learning technology into the RGB-D based salient object detection task. More recently, various deep learning-based models [41] [42] [43] [44] [53] [54] [55] have focused on exploiting effective multi-modal correlations and multi-scale or level information to boost salient object detection performance. To more clearly describe the progress in the RGB-D based salient object detection field, we provide a brief chronology in Fig. 2 .

In this paper, we provide a comprehensive survey of RGB-D based salient object detection, aiming to thoroughly cover various aspects of models used for this task and to provide insightful discussions of the challenges and open directions for future work. We also review a related topic, light field salient object detection, as light fields can also provide additional information (including focal stacks, all-focus images, and depth maps) to boost the performance of salient object detection. Further, we provide a comprehensive comparative evaluation of existing RGB-D based salient object detection models and discuss their main advantages.

Several surveys consider salient object detection. For example, Borji et al. [59] provided a quantitative evaluation of 35 state-of-the-art non-deep-learning saliency detection methods. Cong et al. [60] reviewed several different saliency detection models, including RGB-D based salient object detection, co-saliency detection, and video salient object detection. Zhang et al. [61] provided an overview of co-saliency detection and reviewed its history, and summarized several benchmark algorithms in this field. Han et al. [62] reviewed recent progress in salient object detection, including models, benchmark datasets, and evaluation metrics, as well as discussing the underlying connection between general object detection, salient object detection, and categoryspecific object detection. Nguyen et al. [63] reviewed [46] . Deep learning techniques have been widely applied since 2017. See Section 2. various works related to saliency applications and provide insightful discussions of the role of saliency in each. Borji et al. [64] provided a comprehensive review of recent progress in salient object detection and discussed related topics, including generic scene segmentation, saliency for fixation prediction, and object proposal generation. Fan et al. [1] provided a comprehensive evaluation of several state-of-theart CNN-based salient object detection models, and proposed a high quality salient object detection dataset, SOC (see: http://dpfan.net/socbenchmark/). Zhao et al. [65] reviewed various deep learning-based object detection models and algorithms in detail, as well as various specific tasks, including salient object detection. Wang et al. [66] focused on reviewing deep learning-based salient object detection models. Unlike previous salient object detection surveys, in this paper, we focus on reviewing RGB-D based salient object detection models and benchmark datasets.

Our contributions and organization are:

• the first systematic review of RGB-D based salient object detection models considering different perspectives. We classify existing RGB-D salient object detection models as traditional or deep methods, fusion-wise methods, single-stream or multi-stream methods, and attention-aware methods (Section 2); • a review of nine RGB-D datasets commonly used in this field, giving details for each (Section 3). We also provide a comprehensive, attribute-based evaluation of several representative RGB-D based salient object detection models (Section 5); • the first survey of light field salient object detection models and benchmark datasets (Section 4); • a thorough investigation of challenges facing RGB-D based salient object detection, and the relationship between salient object detection and other topics, shedding light on potential directions for future research (Section 6); Conclusions are drawn in Section 7.

Over the past few years, several RGB-D based salient object detection methods have been developed; they provide promising performance. These models are summarized in Tables 1-4 . Further information can be found at http://dpfan.net/d3netbenchmark/. To review these RGB-D based salient object detection 

Using depth cues, several useful attributes, such as boundaries, shape attributes, surface normals, etc., 

The above traditional methods suffer from unsatisfactory salient object detection performance due to the limited expressiveness of handcrafted features. To address this, several studies have turned to deep neural networks (DNNs) to fuse RGB-D data [39, 40, 42-44, 52-55, 83, 93, 94, 96, 102-106, 111-113, 117-119, 137] . These models can learn highlevel representations to explore complex correlations between RGB images and depth cues for improving salient object detection performance. We next review some representative works. DF [52] develops a novel convolutional neural network (CNN) to integrate different low-level saliency cues into hierarchical features, to effectively locate salient regions in RGB-D images. This was the first CNN-based model for RGB-D salient object detection. However, it utilizes a shallow architecture to learn the saliency map. PCF [92] presents a complementarity-aware fusion module to integrate cross-modal and cross-level feature representations. It can effectively exploit complementary information by explicitly using crossmodal and -level connections and modal-and levelwise supervision to decrease fusion ambiguity.

CTMF [58] employs a computational model to identify salient objects from RGB-D scenes, utilizing CNNs to learn high-level representations for RGB images and depth cues, while simultaneously exploiting the complementary relationships and joint representation. This model transfers the structure of the model from the source domain (RGB images) to the target domain (depth maps).

CPFP [53] proposes a contrast-enhanced network to produce an enhanced map, and presents a fluid pyramidal integration module to effectively fuse crossmodal information in a hierarchical manner. As depth cues tend to suffer from noise, a feature-enhanced module is used to learn enhanced depth cues for to effectively boost salient object detection performance.

UC-Net [44] proposes a probabilistic RGB-D based salient object detection network via conditional variational autoencoders to model human annotation uncertainty. It generates multiple saliency maps for each input image by sampling the learned latent space. This was the first work to investigate uncertainty in RGB-D based salient object detection, and was inspired by the data labeling process. It leverages diverse saliency maps to improve the final salient object detection performance.

For RGB-D based salient object detection models, it is important to effectively fuse RGB images and depth maps. Existing fusion strategies can be classified as using early fusion, multi-scale fusion, or late fusion, as we now explain; also see Fig. 3 .

Early fusion-based methods work in one of two ways: (i) RGB images and depth maps are directly integrated to form a four-channel input [50, 51, 87, 96] , which we call input fusion, or (ii) RGB and depth images are first fed into separate networks and their low-level representations are combined to give a joint representation which is then fed into a subsequent network for further saliency map prediction [52] . We call this early feature fusion.

Late fusion-based methods can also be further divided into two families: (i) two parallel network streams are adopted to learn high-level features for RGB and depth data, respectively, which are concatenated and then used to generate the final saliency prediction [48, 58, 106] . We call this later feature fusion. (ii) Two parallel network streams are used to obtain independent saliency maps for RGB images and depth cues, and then the two saliency maps are concatenated to obtain a final prediction map [108] . This is called late result fusion.

To effectively explore the correlations between RGB images and depth maps, several methods propose a multi-scale fusion strategy [42, 43, 55, 109, 116, 122, 123, 128] . These models can be divided into two categories. The first learns the cross-modal interactions and then fuses them into a feature learning network. For example, Chen et al. [55] developed a multi-scale, multi-path fusion network to integrate RGB images and depth maps, with a crossmodal interaction (MMCI) module. This method introduces cross-modal interactions into multiple layers, which can provide additional gradients for enhancing learning of the depth stream, as well as enabling complementarity between low-level and highlevel representations to be explored. The second category fuses features from RGB images and depth maps in different layers and then integrates them into a decoder network (e.g., via skip connections) to produce the final saliency detection map. Some representative works are now briefly discussed.

ICNet [42] proposes an information conversion module to interactively convert high-level features. In this model, a cross-modal depth-weighted combination (CDC) block is introduced to enhance RGB features with depth features at different levels.

DPANet [109] uses a gated multi-modality attention (GMA) module to exploit long-range dependencies. The GMA module can extract the most discriminatory features by utilizing a spatial attention mechanism. This model also controls the fusion rate of the cross-modal information using a gate function, which can reduce some effects caused by unreliable depth cues.

BiANet [116] employs a multi-scale bilateral attention module (MBAM) to capture better global information from multiple layers.

JL-DCF [43] treats a depth image as a special case of a color image and employs a shared CNN for both RGB and depth feature extraction. It also proposes a densely-cooperative fusion strategy to effectively combine the features learned from different modalities.

BBS-Net [128] uses a bifurcated backbone strategy (BBS) to split the multi-level feature representations into teacher and student features, and develops a depth-enhanced module (DEM) to explore informative parts in depth maps from the spatial and channel views.

Several RGB-D based salient object detection works [52, 53, 83, 87, 93, 96, 102] focus on a singlestream architecture to achieve saliency prediction. These models often fuse RGB images and depth information in the input channel or feature learning part. For example, MDSF [87] employs a multiscale discriminative saliency fusion framework as the salient object detection model, in which four types of features from three levels are computed and then fused to obtain the final saliency map. BED [83] utilizes a CNN architecture to integrate bottom-up and top-down information for salient object detection. It incorporates multiple features, including background enclosure distribution (BED) and low level depth maps (e.g., depth histogram distance and depth contrast) to boost salient object detection performance. PDNet [102] extracts depthbased features using a subsidiary network, which makes full use of depth information to assist the main-stream network.

Two-stream models [54, 106, 111] have two independent branches to process RGB images and depth cues, respectively, and often generate different highlevel features or saliency maps, and then incorporate them in the middle stage or at the end of the two streams. Most recent deep learning-based models [40, 42, 45, 55, 92, 104, 109, 112, 114, 117] utilize this two-stream architecture with several models capturing the correlations between RGB images and depth cues across multiple layers. Moreover, some models utilize a multi-stream structure [38, 103] and then design different fusion modules to effectively fuse RGB and depth information in order to exploit their correlations.

Existing RGB-D based salient object detection methods often treat all regions equally using the extracted features in the same way, while ignoring the fact that different regions can make different contributions to the final prediction map. These methods are easily affected by cluttered backgrounds. Furthermore, some methods either regard the RGB images and depth maps as having the same status or overly rely on depth information. This prevents them from considering the importance of different domains (RGB images or depth cues). To overcome such issues, several methods introduce attention mechanisms to weight the importance of different regions or domains.

ASIF-Net [117] captures complementary information from RGB images and depth cues using interwoven fusion, and weights saliency regions through a deeply supervised attention mechanism.

AttNet [111] introduces attention maps for differentiating between salient objects and background regions to reduce the negative influence of certain low-quality depth cues.

TANet [103] formulates a multi-modal fusion framework using RGB images and depth maps from bottom-up and top-down views. It then introduces a channel-wise attention module to effectively fuse the complementary information from different modalities and levels.

Available open-source implementations of RGB-D based salient object detection models reviewed in this survey are provided in Table 5 . Further source code will 

With the rapid development of RGB-D based salient object detection, various datasets have been constructed over the past several years. Table 6 summarizes nine popular RGB-D datasets, and Fig. 4 shows examples of images (including RGB images, depth maps, and annotations) from these datasets. We provide details for each dataset next. STERE [139] . The authors collected 1250 stereoscopic images from Flickr (http://www.flickr.com/), NVIDIA 3D Vision Live (http://photos.3dvisionlive .com/), and the Stereoscopic Image Gallery (http://www.stereophotography.com/). The most salient objects in each image were annotated by three users. All annotated images were then sorted based on the overlapping salient regions and the top 1000 images were selected to construct the final dataset. This was the first collection of stereoscopic images in this field.

GIT [47] consists of 80 color and depth images, collected using a mobile-manipulator robot in a realworld home environment. Each image is annotated based on pixel-level segmentation of its objects.

DES [49] consists of 135 indoor RGB-D images, taken by Kinect at a resolution of 640 × 640. When collecting this dataset, three users were asked to label the salient object in each image, and overlapping labeled areas were regarded as the ground truth. NLPR [51] consists of 1000 RGB images and corresponding depth maps, obtained by a standard Microsoft Kinect. This dataset includes a series of outdoor and indoor locations, e.g., offices, supermarkets, campuses, streets, and so on.

LFSD [140] includes 100 light fields collected using a Lytro light field camera, and consists of 60 indoor and 40 outdoor scenes. To label this dataset, three individuals were asked to manually segment salient regions; the segmented results were deemed ground truth when the overlap of the three results was over 90%.

NJUD [56] consists of 1985 stereo image pairs, collected from the Internet, 3D movies, and photographs taken by a Fuji W3 stereo camera.

SSD [85] was constructed using three stereo movies and includes indoor and outdoor scenes. It includes 80 samples; each image has resolution of 960 × 1080. [137] consists of 800 indoor and 400 outdoor scenes with corresponding depth images. This dataset provides several challenging factors: multiple and transparent objects, complex backgrounds, similar foregrounds to backgrounds, and low-intensity environments.

SIP [38] consists of 929 annotated high-resolution images, with multiple salient persons in each image. In this dataset, depth maps were captured using a smart phone (Huawei Mate10). This dataset covers diverse scenes and various challenging factors, and is annotated with pixel-level ground truth.

A detailed dataset statistical analysis (including center bias, size of objects, background objects, object boundary conditions, and number of salient objects) can be found in Ref. [38] .

Salient object detection methods can be grouped into three categories according to the input data type: RGB, RGB-D, or light field [141] . We have already reviewed RGB-D based salient object detection models, in which depth maps provide geometric information to improve salient object detection performance to some extent. However, inaccurate or low-quality depth maps often decrease performance. To overcome this issue, light field salient object detection methods have been proposed to make use of the rich information captured by a light field. Specifically, light field data can provide an all-focus image, a focal stack, and a rough depth map [137] . A summary of light field salient object detection works is provided in Table 7 ; we now review them in more detail.

Classic models for light field salient object detection often use superpixel-level handcrafted features [137, 140, 142-147, 149, 155] . Early work [140, 147] showed that the unique refocusing capability of light fields can provide useful focus, depth, and object identity cues, leading to several salient object detection models using light field data. For example, Zhang et al. [143] utilized a set of focal slices to compute 

Several refinement strategies have been used to enforce neighborhood constraints or to reduce the homogeneity of multiple modalities for salient object detection. For example, in Ref. [142] , the saliency dictionary was refined using an estimated saliency map. The MA method [145] employs a two-stage saliency refinement strategy to produce the final prediction map, so that adjacent superpixels obtain similar saliency values. LFNet [141] presents an effective refinement module to reduce the homogeneity between different modalities as well to refine their dissimilarities.

Five representative datasets are widely used in existing light field salient object detection methods, as we now describe. LFSD [140] consists of 100 light fields of different scenes with 360×360 spatial resolution, captured using a Lytro light field camera. This dataset contains 60 indoor and 40 outdoor scenes, and most scenes include only one salient object. Three individuals were asked to manually segment salient regions in each image, and ground truth was determined to occur when all three segmentation results had an overlap of over 90%. (https://sites.duke.edu/nianyi/publication/saliencydetection-on-light-field/)

HFUT [145] consists of 255 light fields captured using a Lytro camera. Most scenes contain multiple objects at different locations and scales, with complex background clutter. (https://github.com/ pencilzhang/HFUT-Lytro-dataset) DUTLF-FS [151] consists of 1465 samples, 1000 for use as a training set, and 465 for a test set. The resolution of each image is 600 × 400. This dataset contains several challenges, including low contrast between salient objects and cluttered backgrounds, multiple disconnected salient objects, and dark and bright lighting conditions.

(https://github.com/ OIPLab-DUT/ICCV2019 Deeplightfield Saliency) DUTLF-MV [152] consists of 1580 samples, 1100 for training and the remainder for testing. Images were captured by a Lytro Illum camera, and each light field consists of multi-view images and corresponding ground truth. (https://github.com/OIPLab-DUT/ IJCAI2019-Deep-Light-Field-Driven-Saliency-Detectionfrom-A-Single-View) Lytro Illum [156] consists of 640 light fields and the corresponding per-pixel ground-truth saliency maps. It includes several challenging factors, e.g., inconsistent illumination conditions, and small salient objects existing in a similar or cluttered background. (https://github.com/pencilzhang/MAClight-field-saliency-net)

We briefly review several popular metrics for salient object detection evaluation: precision-recall (PR), Fmeasure [59, 157] , mean absolute error (MAE) [158] , structural measure (S-measure) [159] , and enhancedalignment measure (E-measure) [160] .

PR. Given a saliency map S, we can convert it to a binary mask M , and then compute the precision P and recall R by comparing M with a ground-truth map G:

A popular strategy is to partition the saliency map S using a set of thresholds (from 0 to 255). For each threshold, we calculate a pair of recall and precision scores, and then combine them to obtain a PR curve that describes the performance of the model as threshold varies. F-measure (F β ). The F-measure takes into account both precision and recall in a single measure, using the weighted harmonic mean:

where β 2 is set to 0.3 to emphasize precision [157] . We may again vary threshold and compute the Fmeasure, yielding a set of F-measure values, from which we report the maximal or average F β . MAE. This measures the average pixel-wise absolute error between a saliency map S and a ground truth map G for all pixels. It can be defined by

where W and H denote the width and height of the map, respectively. MAE values are normalized to [0, 1]. S-measure (S α ). To capture the importance of the structural information in an image, S α [159] is used to assess the structural similarity between the regional perception (S r ) and object perception (S o ). Thus, S α can be defined by

where α ∈ [0, 1] is a weight. We set α = 0.5 as the default, as suggested by Fan et al. [159] . E-measure (E φ ). E φ [160] was proposed based on cognitive vision studies to capture image-level statistics and local pixel matching information. Thus, E φ can be defined by

where φ FM denotes the enhanced-alignment matrix [160] .

To quantify the performance of different models, we conducted a comprehensive evaluation of 24 representative RGB-D based salient object detection models, including nine traditional methods: LHM [51] , ACSD [56] , DESM [49] , GP [50] , LBE [57] , DCMC [36] , SE [37] , CDCP [84] , CDB [95] , and fifteen deep learning-based methods: DF [52] , PCF [92] , CTMF [58] , CPFP [53] , TANet [103] , AFNet [106] , MMCI [55] , DMRA [54] , D 3 Net [38] , SSF [39] , A2dele [40] , S 2 MA [41] , ICNet [42] , JL-DCF [43] , and UC-Net [44] . We report the mean values of S α and MAE across the five datasets (STERE [139] , NLPR [51] , LFSD [140] , DES [49] , and SIP [38] ) for each model in Fig. 5 . Better models appear in the upper left corner (i.e., with larger S α and smaller MAE). From Fig. 5 , we may make following observations:

• Traditional versus deep learning models. Compared to traditional RGB-D based salient object detection models, deep learning methods obtain significantly better performance. This confirms the powerful feature learning ability of deep networks.

Comprehensive evaluation of 24 representative RGB-D based salient object detection models: LHM [51] , ACSD [56] , DESM [49] , GP [50] , LBE [57] , DCMC [36] , SE [37] , CDCP [84] , CDB [95] , DF [52] , PCF [92] , CTMF [58] , CPFP [53] , TANet [103] , AFNet [106] , MMCI [55] , DMRA [54] , D 3 Net [38] , SSF [39] , A2dele [40] , S 2 MA [41] , ICNet [42] , JL-DCF [43] , and UC-Net [44] . For each, we report the mean values of Sα and MAE across five datasets: STERE [139] , NLPR [51] , LFSD [140] , DES [49] , and SIP [38] . Better models appear in the upper left corner (i.e., with larger Sα and smaller MAE). Red diamonds: deep models. Green circles: traditional models.

• Comparison of deep models. Among the deep learning-based models, D 3 Net [38] , JL-DCF [43] , UC-Net [44] , SSF [39] , ICNet [42] , and S 2 MA [41] obtain the best performance. Figures 6 and 7 show PR and F-measure curves for the 24 representative RGB-D based salient object detection models, for eight datasets: STERE [139] , NLPR [51] , LFSD [140] , DES [49] , SIP [38] , GIT [47] , SSD [85] , and NJUD [56] ). Note that there are 1000, 300, 100, 135, 929, and 80 test samples for NLPR, LFSD, DES, SIP, GIT, and SSD, respectively. For the NJUD [56] dataset, there are 485 test images for CPFP [53] , S 2 MA [41] , ICNet [42] , JL-DCF [43] , and UC-Net [44] , and 498 testing images for all other models.

To understand the best six models in depth, we discuss their main advantages below. D 3 Net [38] consists of two key components, a three-stream feature learning module and a depth purifier unit. The three-stream feature learning module has three subnetworks: RgbNet, RgbdNet, and DepthNet. RgbNet and DepthNet are used to learn high-level feature representations for RGB and depth images, respectively, while RgbdNet is used to learn their fused representations. This threestream feature learning module can capture modalityspecific information as well as the correlation between modalities. Balancing the two aspects is very important for multi-modal learning and helps to improve the salient object detection performance. The depth purifier unit acts as a gate to explicitly remove low-quality depth maps, whose effects other existing methods often do not consider. Because lowquality depth maps can hinder fusion of RGB images and depth maps, the depth purifier unit can ensure effective multi-modal fusion to achieve robust salient object detection.

JL-DCF [43] has two key components, for joint learning (JL) and densely-cooperative fusion (DCF). [139] , NLPR [51] , LFSD [140] , DES [49] , SIP [38] , GIT [47] , SSD [85] , and NJUD [56] datasets. [139] , NLPR [51] , LFSD [140] , DES [49] , SIP [38] , GIT [47] , SSD [85] , and NJUD [56] datasets.

Specifically, the JL module is used to learn robust saliency features, while the DCF module is used for complementary feature discovery. This method uses a middle-fusion strategy to extract deep hierarchical features from RGB images and depth maps, in which cross-modal complementarity is effectively exploited to achieve accurate prediction.

UC-Net [44] , instead of producing a single saliency prediction, produces multiple predictions by modeling the distribution of the feature output space as a generative model conditioned on RGB-D images. Because each person has specific preferences in labeling a saliency map, the stochastic characteristic of saliency may not be captured when a single saliency map is produced for an image pair using a deterministic learning pipeline. The strategy in this model can take into account human uncertainty in saliency annotation. Moreover, depth maps can suffer from noise. Directly fusing RGB images and depth maps can cause the network to fit this noise. Therefore, a depth correction network, designed as an auxiliary component, is used to refine depth information with a semantic guided loss. All of these key components help to improve salient object detection performance.

In SSF [39] , a complementary interaction module (CIM) is developed to explore discriminative crossmodal complementarity and to fuse cross-modal features, where region-wise attention is introduced to supplement rich boundary information for each modality. A compensation-aware loss is used to improve the network's confidence for hard samples in unreliable depth maps. These key components enable the proposed model to effectively explore and establish the complementarity of cross-modal feature representations, while at the same time reducing the negative effects of low-quality depth maps, boosting salient object detection performance.

ICNet [42] uses an information conversion module to interactively and adaptively explore correlations between high-level RGB and depth features. A cross-modal depth-weighted combination block is introduced to enhance the differences between the RGB and depth features at each level, ensuring that the features are treated differently. ICNet exploits the complementarity of cross-modal features, as well as exploring continuity of cross-level features, both of which help to achieve accurate predictions. S 2 MA [41] uses a self-mutual attention module (SAM) to fuse RGB and depth images, integrating selfattention and mutual attention to propagate context more accurately. The SAM can provide additional complementary information from multi-modal data to improve salient object detection performance, overcoming the limitations of only using self-attention, i.e., a single modality. To reduce the effects of lowquality depth cues (due to e.g., noise), a selection mechanism is used to reweight the mutual attention. This can filter out unreliable information, resulting in more accurate saliency prediction.

To investigate the influence of different factors, such as object scale, background clutter, number of salient objects, indoor or outdoor scene, background objects, and lighting conditions, we carried out diverse attribute-based evaluations on several representative RGB-D based salient object detection models.

Object scale. To characterize the scale of a salient object, we compute the ratio of the size of the salient area to that of the whole image. We define three object scales: small, when the ratio is less than 0.1, large, when the ratio is greater than 0.4, and medium, otherwise. For this evaluation, we built a hybrid dataset with 2464 images collected from STERE [139] , NLPR [51] , LFSD [140] , DES [49] , and SIP [38] , where 24%, 69.2%, and 6.8% of images have small, medium, and large salient objects respectively. The constructed hybrid dataset can be found at https://github.com/taozh2017/RGBD-SODsurvey. Some sample images with objects of different scales are shown in Fig. 8 . The results of the attribute-based comparison w.r.t. object scale are shown in Table 8 . It can be observed that all methods perform best at detecting small salient objects and worst for large salient objects. The three most recent models: JL-DCF [43] , UC-Net [44] , and S 2 MA [41] , achieve the best performance. D 3 Net [38] , SSF [39] , A2dele [40] , and ICNet [42] also obtain promising performance.

Background clutter. It is difficult to directly characterize background clutter.

Since classic salient object detection methods tend to use prior information or color contrast to locate salient objects, they often fail in the presence of complex backgrounds. Thus, in this evaluation, we utilize five traditional salient object detection methods: BSCA [161] , CLC [162] , MDC [163] , MIL [164] , and WFD [165] , to first detect salient objects in various images, and then categorise these images as having simple or complex backgrounds according to the results. Specifically, we first constructed a hybrid dataset with 1400 images collected from three datasets (STERE [139] , NLPR [51] , and LFSD [140] ). Then, we applied the five models to this dataset and obtained S α values for each image, which we used to characterize images as follows. If all S α values are higher than 0.9, the image is considered to have a simple background. If all S α values are lower than 0.6, the image is said to have a complex background. The remaining images are deemed to be uncertain. Some example images with these three types of background clutter are shown in Fig. 9 . The constructed hybrid dataset can be found at https://github.com/taozh2017/RGBD-SODsurvey.

The results of the attribute-based comparison w.r.t. background clutter are shown in Table 9 . All models are worse at salient object detection for images with complex backgrounds than simple ones. Among the representative models, JL-DCF [43] , UC-Net [44] , and SSF [39] achieve the three best results. The four most recent models: D 3 Net [38] , S 2 MA [41] , A2dele [40] , and ICNet [42] , obtain better performance than the other models.

For this evaluation, we constructed a hybrid dataset with 1229 images from the NLPR [51] and SIP [38] datasets. Some example images with single and multiple salient objects are shown in Fig. 10 . The comparison results are shown in Fig. 11 . From the results, we can see that it is easier to detect single salient object than multiple ones. 

Attribute-based study w.r.t. number of salient objects (single or multiple). Comparative results for 24 representative RGB-D based salient object detection models: LHM [51] , ACSD [56] , DESM [49] , GP [50] , LBE [57] , DCMC [36] , SE [37] , CDCP [84] , CDB [95] , DF [52] , PCF [92] , CTMF [58] , CPFP [53] , TANet [103] , AFNet [106] , MMCI [55] , DMRA [54] , D 3 Net [38] , SSF [39] , A2dele [40] , S 2 MA [41] , ICNet [42] , JL-DCF [43] , and UC-Net [44] in terms of MAE (above) and Sα (below).

We evaluated the performance of different RGB-D based salient object detection models on indoor and outdoor scenes. For this evaluation, we constructed a hybrid dataset collected from the DES [49] , NLPR [51] , and LFSD [140] datasets. The results are shown in Fig. 12 . It can be seen that most models struggle more to detect salient objects in indoor scene than outdoor scenes. This is possibly because indoor environments often have varying lighting conditions. Background objects. We evaluated the performance of RGB-D based salient object detection models in the presence of different backgrounds. We used the SIP dataset [38] , and split it into eight categories: car, barrier, flower, grass, road, sign, tree, and other. The results of the comparison are shown in Table 10 . All methods obtain diverse performances with different background objects. Among the 24 representative RGB-D based models, JL-DCF [43] , UC-Net [44] , and SSF [39] achieve the three best results. The four most recent models, i.e., D 3 Net [38] , S 2 MA [41] , A2dele [40] , and ICNet [42] obtain better performance than the others.

Lighting conditions. The performance of salient object detection methods can be affected by the lighting conditions. To determine the effects on different RGB-D based salient object detection models, we conducted an evaluation on the SIP dataset [38] , whose images we split into two categories: sunny and low-light. The results of the comparison are shown in Table 11 .

Low light negatively impacts salient object detection performance. Among the models compared, UC-Net [44] obtained the best performance under sunny conditions, while JL-DCF [43] achieved the best result under low light.

Visual comparison. We further report saliency maps generated for various challenging scenes to allow visualization of the performance of different RGB-D based salient object detection models. Figures 13 and 14 show some representative examples for two classic non-deep methods: DCMC [36] and SE [37] , and eight state-of-the-art CNN-based models: DMRA [54] , D 3 Net [38] , SSF [39] , A2dele [40] , S 2 MA [41] , ICNet [42] , JL-DCF [43] , and UC-Net [44] . Row 1 shows a small object, while row 2 shows a large object. Rows 3 and 4 contain complex backgrounds Fig. 12 Attribute-based study w.r.t. indoor vs. outdoor environments. Comparative results for 24 representative RGB-D based salient object detection models: LHM [51] , ACSD [56] , DESM [49] , GP [50] , LBE [57] , DCMC [36] , SE [37] , CDCP [84] , CDB [95] , DF [52] , PCF [92] , CTMF [58] , CPFP [53] , TANet [103] , AFNet [106] , MMCI [55] , DMRA [54] , D 3 Net [38] , SSF [39] , A2dele [40] , S 2 MA [41] , ICNet [42] , JL-DCF [43] , and UC-Net [44] in terms of MAE (above) and Sα (below).

Attribute-based study w.r.t. background objects: car, barrier, flower, grass, road, sign, tree, and other. The methods compared including 24 representative RGB-D based salient object detection models (9 traditional and 15 deep learning-based) evaluated on the SIP dataset [38] in terms of MAE and Sα. The three best results are shown in red, blue, and green [42] , JL-DCF [43] , and UC-Net [44] .

[41], JL-DCF [43] , and UC-Net [44] perform better than other deep models.

Depth maps with detailed spatial information have proven beneficial in detecting salient objects against cluttered backgrounds, while the depth quality directly affects salient object detection performance. The quality of depth maps varies tremendously across different scenarios due to the nature of depth sensors, posing a challenge when trying to reduce the effects of low-quality depth maps. However, most existing methods directly fuse RGB images and original raw data from depth maps, without considering the effects of low-quality depth maps.

There are a few notable exceptions. For example, in Ref. [53] , a contrast-enhanced network was proposed to learn enhanced depth maps, with much higher contrast than the original depths. In Ref. [39] , a compensation-aware loss was designed to pay more attention to hard samples containing unreliable depth information. D 3 Net [38] uses a depth purifier unit to classify depth maps as reasonable or low-quality. It also acts as a gate to filter out low-quality depth maps. However, such methods often employ a twostep strategy to achieve depth enhancement and multi-modal fusion [39, 53] or an independent gate operation to remove poor depths, which could lead to a suboptimal problem. There is thus a need to develop an end-to-end framework that can achieve depth enhancement or adaptively assign low weights to poor depth maps during multi-modal fusion, which would be more helpful in reducing the effects of low-quality depth maps and boosting salient object detection performance.

In RGB-D datasets, it is inevitable for there to be some low-quality depth maps due to the limitations of the acquisition devices. As previously discussed, several depth enhancement algorithms have been used to improve the quality of depth maps. However, depth maps that suffer from severe noise or blurred edges are often discarded. In this case, we have complete RGB images but some samples without depth maps, which is similar to the incomplete multi-view modal learning problem [166] [167] [168] [169] [170] . We may call this problem incomplete RGB-D based salient object detection. As current models only focus on salient object detection using complete RGB images and depth maps, we believe this could be a new direction for RGB-D salient object detection.

Depth estimation provides an effective solution to recover high-quality depths and overcome the effects of low-quality depth maps. Various depth estimation approaches [171] [172] [173] [174] have been developed, which could be introduced into the RGB-D based salient object detection task to improve performance.

It is important to effectively fuse RGB images and depth maps for RGB-D based salient object detection. Existing models often employ different fusion strategies (early fusion, middle fusion, or late fusion) to exploit the correlations between RGB images and depth maps. Recently, generative adversarial networks (GANs) [175] have gained widespread attention for the saliency detection task [176, 177] . In common GAN-based salient object detection models, a generator takes RGB images as inputs and generates the corresponding saliency maps, while a discriminator determines whether a given image is synthetic or ground-truth. GANbased models could easily be extended to RGB-D salient object detection, which could help to boosting performance due to their superior feature learning ability. Moreover, GANs could also be used to learn common feature representations for RGB images and depth maps [114] , which could help with feature or saliency map fusion and further boost salient object detection performance.

Attention mechanisms have been widely applied to various deep learning-based tasks [178] [179] [180] [181] , allowing networks to selectively pay attention to a subset of regions for extracting powerful and discriminative features. Co-attention mechanisms have also been developed to explore the underlying correlations between multiple modalities. They are widely studied in visual question answering [182, 183] and video object segmentation [184] . Thus, for the RGB-D based salient object detection task, we could also develop attention-based fusion algorithms to exploit correlations between RGB images and depth cues to improve the performance.

Existing RGB-D models often use a fully supervised strategy to learn saliency prediction models. However, annotating pixel-level saliency maps is a tedious and time-consuming procedure. To alleviate this issue, there has been increasing interest in weakly and semi-supervised learning, which have been applied to salient object detection [185] [186] [187] [188] [189] . Semi-and weak supervision could also be introduced into RGB-D salient object detection, by leveraging image-level tags [185] and pseudo pixel-wise annotations [188, 190] , to improve detection performance. Furthermore, several studies [191, 192] have suggested that models pretrained using self-supervision can effectively be used to achieve better performance. Therefore, we could train saliency prediction models on large amounts of annotated RGB images in a self-supervised manner and then transfer the pre-trained models to the RGB-D salient object detection task.

Although there are nine public RGB-D datasets for salient object detection, their size is quite limited, with the largest, NJUD [56] , containing about 2000 samples. When compared to other RGB-D datasets for generic object detection or action recognition [193, 194] , the RGB-D datasets for salient object detection are very small. Thus, it is essential to develop new large-scale RGB-D datasets to serve as baselines for future research.

Most existing RGB-D datasets contain images with one salient object, or multiple objects but against a relatively clean background. However, real-world applications often involve much more complicated situations, e.g., occlusion, appearance change, and low illumination, which can reduce salient object detection performance. Thus, collecting images with complex backgrounds is critical to improving the generalizability of RGB-D salient object detection models. Moreover, for some tasks, images with specific salient object(s) must be collected. For example, road sign recognition is important in driver assistance systems, requiring images with road signs to be collected. Thus, it is essential to construct task-driven RGB-D datasets like SIP [38] .

Some smart phones can capture depth maps (e.g., images in the SIP dataset were captured using a Huawei Mate10). Thus it is feasible to perform salient object detection for real-world applications on smart devices. However, most existing methods include complicated and deep DNNs to increase model capacity and for better performance, preventing them from being directly applied to such platforms.

To overcome this, model compression [195, 196] techniques could be used to learn compact RGB-D based salient object detection models with promising detection accuracy. Moreover, JL-DCF [43] utilizes a shared network to locate salient objects using RGB and depth views, which largely reduces the model parameters and makes real-world applications feasible.

In addition to RGB-D salient object detection, there are several other methods that fuse different modalities for better detection, such as RGB-T salient object detection, which integrates RGB and thermal infrared data. Thermal infrared cameras can capture the heat radiation emitted from any object, making thermal infrared images insensitive to illumination conditions [197] . Therefore, thermal images can provide supplementary information to improve salient object detection when images of salient objects suffer from varying light, glare, or shadows. Some RGB-T models [197] [198] [199] [200] [201] [202] [203] [204] [205] and datasets (VT821 [199] , VT1000 [203] , and VT5000 [205] ) have already been proposed over the past few years. Like for RGB-D salient object detection, the key aim of RGB-T salient object detection is to fuse RGB and thermal infrared images and exploit the correlations between the two modalities. Thus, several advanced multi-modal fusion technologies in RGB-D salient object detection could be extended to the RGB-T salient object detection task.

This paper has presented the first comprehensive review of RGB-D based salient object detection models. We have reviewed the models from different perspectives, and summarized popular RGB-D salient object detection datasets as well as providing details of each. As light fields also provide depth information, we have also reviewed popular light field salient object detection models and related benchmark datasets. We have comprehensively evaluated 24 representative RGB-D based salient object detection models, as well as performing an attribute-based evaluation based on new datasets. Moreover, we have discussed several challenges and highlighted open directions for future research. In addition, we have briefly discussed the extension to RGB-T salient object detection to improve robustness to lighting conditions. Although RGB-D based salient object detection has made notable progress over the past several decades, there is still significant room for improvement. We hope this survey will generate more interest in this field. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http:// creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.

Salient objects in clutter: Bringing salient object detection to the foreground

Multi-level context ultraaggregation for stereo matching

Unsupervised object class discovery via saliencyguided multiple class learning

Re-thinking co-salient object detection

Dense saliency-based spatiotemporal feature points for action recognition

Shifting more attention to video salient object detection

Saliency-aware video object segmentation

Pyramid dilated deeper ConvLSTM for video salient object detection

Video salient object detection via fully convolutional networks

Distinct class-specific saliency maps for weakly supervised semantic segmentation

Joint learning of saliency detection and weakly supervised semantic segmentation

PraNet: Parallel reverse attention network for polyp segmentation

Inf-Net: Automatic COVID-19 lung infection segmentation from CT images

An explainable COVID-19 diagnosis system by joint classification and segmentation

Saliency-based discriminant tracking

Online tracking by learning discriminative saliency map with convolutional neural network

Person reidentification by saliency learning

Kernelized saliency-based person Re-identification through multiple metric learning

Camouaged object detection

A model of visual attention for natural image retrieval

Edge guidance network for salient object detection

Real-time salient object detection with a minimum spanning tree

What is and what is not a salient object? Learning salient object detector by ensembling linear exemplar regressors

Saliency detection: A spectral residual approach

Hierarchical saliency detection

Saliency detection via graph-based manifold ranking

Deep contrast learning for salient object detection

Co-saliency detection via a self-paced multiple-instance learning framework

Aggregating multi-level convolutional features for salient object detection

Learning uncertain convolutional features for accurate saliency detection

A stagewise refinement model for detecting salient objects in images

Contour knowledge transfer for salient object detection

Salient object detection with pyramid attention and salient edges

Selectivity or invariance: Boundary-aware salient object detection

Pyramid feature attention network for saliency detection

Saliency detection for stereoscopic images based on depth confidence analysis and multiple cues fusion

Salient object detection for RGB-D image via saliency evolution

Rethinking RGB-D salient object detection: Models, data sets, and large-scale benchmarks

Select, supplement and focus for RGB-D saliency detection

Adaptive and attentive depth distiller for efficient RGB-D salient object detection

Learning selective selfmutual attention for RGB-D saliency detection

ICNet: Information conversion network for RGB-D based salient object detection

Joint learning and densely-cooperative fusion framework for RGB-D salient object detection

UC-Net: Uncertainty inspired RGB-D saliency detection via conditional variational autoencoders

CNN-based RGB-D salient object detection: Learn, select and fuse

Depth matters: influence of depth cues on visual saliency

An in depth view of saliency

Depth really matters: Improving visual salient region detection with depth

Depth enhanced saliency detection method

Exploiting global priors for RGB-D saliency detection

RGBD salient object detection: A benchmark and algorithms

RGBD salient object detection via deep fusion

Contrast prior and fluid pyramid integration for RGBD salient object detection

Depthinduced multi-scale recurrent attention network for saliency detection

Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection

Depth saliency based on anisotropic center-surround difierence

Local background enclosure for RGB-D salient object detection

CNNs-based RGB-D saliency detection via cross-view transfer and multiview fusion

Salient object detection: A benchmark

Review of visual saliency detection with comprehensive information

A review of co-saliency detection algorithms: Fundamentals, applications, and challenges

Advanced deep-learning techniques for salient and category-specific object detection: A survey

Attentive systems: A survey

Salient object detection: A survey

Object detection with deep learning: A review

Salient object detection in the deep learning era: An in-depth survey

Depth combined saliency detection based on region contrast model

Evaluation and modeling of depth feature incorporated visual attention for salient object segmentation

Depth incorporating with color improves salient object detection

Salient regions detection for indoor robots using RGB-D data

RGB-D saliency detection via mutual guided manifold ranking

Selective features for RGB-D saliency

Improving RGBD saliency detection using progressive region classification and saliency fusion

RGBD saliency detection under bayesian framework

Saliency analysis based on depth contrast increased

Depth-aware saliency detection using discriminative saliency fusion

Visual saliency detection for RGB-D images with generative model

Histogram of surface orientation for RGB-D salient object detection

M 3 Net: Multi-scale multi-path multi-modal fusion network and example application to RGB-D salient object detection

RGB-D saliency detection by multi-stream late fusion network

Learning RGB-D salient object detection using background enclosure, depth contrast, and top-down features

An innovative salient object detection using center-dark channel prior

A three-pathway psychobiological framework of salient object detection using stereoscopic technology

RGB-D salient object detection via minimum barrier distance transform and saliency fusion

Depth-aware salient object detection and segmentation via multiscale discriminative saliency fusion and bootstrap learning

An iterative co-saliency framework for RGBD images

An integration of bottom-up and top-down salient cues on RGB-D data: Saliency from objectness versus non-objectness

HSCS: Hierarchical sparsity based Co-saliency detection for RGBD images

Co-saliency detection for RGBD images based on multi-constraint feature matching and cross label propagation

Progressively complementarity-aware fusion network for RGB-D salient object detection

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

RGBD salient object detection using spatially coherent deep learning framework

Attention-aware crossmodal cross-level fusion network for RGB-D salient object detection

Stereoscopic saliency model using contrast and depth-guided-background prior

Salient object detection for RGB-D image by single stream recurrent convolution neural network

RGB-D salient object detection by a CNN with multiple layers fusion

Two-stream refinement network for RGB-D saliency detection

Salient object segmentation based on depth-aware image layering

Global and local-contrast guides content-aware fusion for RGB-D saliency prediction

Learning-based saliency model with depth information

Prior-model guided depth-enhanced network for salient object detection

Three-stream attention-aware network for RGB-D salient object detection

Discriminative cross-modal transfer learning and densely cross-level feedback fusion for RGB-D salient object detection

Going from RGB to RGBD saliency: A depth-guided transformation model

Adaptive fusion for RGB-D salient object detection

Co-saliency detection for RGBD images based on effective propagation mechanism

Depth-aware saliency detection using convolutional neural networks

Depth potentiality-aware gated attention network for RGB-D salient object detection

Synergistic saliency and depth prediction for RGB-D saliency detection

Attention-guided RGBD saliency detection using appearance information

A cross-modal adaptive gated fusion generative adversarial network for RGB-D salient object detection

CoCNN: RGB-D deep fusion for stereoscopic salient object detection

cmSalGAN: RGB-D salient object detection with cross-view generative adversarial networks

Multi-modal weights sharing and hierarchical feature fusion for RGBD salient object detection

Bilateral attention network for RGB-D salient object detection

ASIF-Net: Attention steered interweave fusion network for RGB-D salient object detection

Triple-complementary network for RGB-D salient object detection

Improved saliency detection in RGB-D images using two-phase depth estimation and selective deep fusion

Gate fusion network with Res2Net for detecting salient objects in RGB-D images

Salient object detection for RGB-D images by generative adversarial network

Cross-modal weighting network for RGB-D salient object detection

Hierarchical dynamic filtering network for RGB-D salient object detection

Cascade graph neural networks for RGB-D salient object detection

RGB-D salient object detection with crossmodality modulation and selection

A single stream network for robust and realtime RGB-D salient object detection

Accurate RGB-D salient object detection via collaborative learning

BBS-Net: RGB-D salient object detection with a bifurcated backbone strategy network

Asymmetric two-stream architecture for accurate RGB-D saliency detection

Progressively guided alternate refinement network for RGB-D salient object detection

Multi-level cross-modal interaction network for RGB-D salient object detection

Data-level recombination and lightweight fusion scheme for RGB-D salient object detection

Knowing depth quality in advance: A depth quality assessment method for RGB-D salient object detection

Depth quality aware salient object detection

Is depth really necessary for salient object detection

RGBD salient object detection via disentangled cross-modal fusion

Saliency detection via depth-induced cellular automata on light field

Feature reintegration over differential treatment: A top-down and adaptive fusion network for RGB-D salient object detection

Leveraging stereopsis for saliency analysis

Saliency detection on light field

Light field fusion network for salient object detection

A weighted sparse coding framework for saliency detection

Saliency detection with a deeper investigation of light field

Relative location for light field saliency detection

Saliency detection on light field

A two-stage Bayesian integration framework for salient object detection on light field

Saliency detection on light field

Saliency detection with relative location measure in light field image

Salience guided depth calibration for perceptually optimized compressive light field 3D display

Depth-induced cellular automata for light field saliency

Deep learning for light field saliency detection

Deep light-field-driven saliency detection from a single view

Memoryoriented decoder for light field salient object detection

Exploit and replace: An asymmetrical two-stream architecture for versatile light field saliency detection

Regionbased depth feature descriptor for saliency detection on light field

Light field saliency detection with deep convolutional networks

Frequency-tuned salient region detection

Saliency filters: Contrast based filtering for salient region detection

Structure-measure: A new way to evaluate foreground maps

Enhanced-alignment measure for binary foreground map evaluation

Saliency detection via cellular automata

Salient region detection via integrating diffusion-based compactness and local contrast

300-FPS salient object detection via minimum directional contrast

Salient object detection via multiple instance learning

Water flow driven salient object detection at 180 fps

Multi-view learning with incomplete views

Effective feature learning and fusion of multimodality data using stage-wise deep neural network for dementia diagnosis

Latent representation learning for Alzheimer's disease diagnosis with incomplete multi-modality neuroimaging and genetic data

Multi-modal latent space inducing ensemble SVM classifier for early dementia diagnosis with neuroimaging data

Hi-net: Hybrid-fusion network for multi-modal MR image synthesis

Unsupervised monocular depth estimation with left-right consistency

Deep convolutional neural fields for depth estimation from a single image

SDC-depth: Semantic divide-and-conquer network for monocular depth estimation

Geometric structure based and regularized depth estimation from 360 indoor imagery

Conditional generative adversarial nets

Multi-scale adversarial feature learning for saliency detection

Visual saliency prediction with generative adversarial networks

Polosukhin, I. Attention is all you need

Residual attention network for image classification

Pairwise body-part attention for recognizing humanobject interactions

Deep visual attention prediction

Hierarchical question-image co-attention for visual question answering

Deep modular co-attention networks for visual question answering

See more, know more: Unsupervised video object segmentation with co-attention siamese networks

Multi-source weak supervision for saliency detection

Bridging saliency detection to weakly supervised object detection based on self-paced curriculum learning

Language-aware weak supervision for salient object detection

Semi-supervised video salient object detection using pseudo-labels

Semi-supervised salient object detection using a linear feedback control system model

Supervision by fusion: Towards unsupervised learning of deep salient object detector

Adversarial robustness: From self-supervised pre-training to fine-tuning

Sparse generative neural networks for self-supervised scene completion of RGB-D scans

A largescale hierarchical multi-view RGB-D object dataset

A large scale RGB-D dataset for action recognition

AutoML for model compression and acceleration on mobile devices

A survey of model compression and acceleration for deep neural networks

Learning multiscale deep features and SVM regressors for adaptive RGB-T saliency detection

RGB-T saliency detection benchmark: Dataset, baselines, analysis and a novel approach

RGB-T saliency detection benchmark: Dataset, baselines, analysis and a novel approach

RGB-T saliency detection via robust graph learning and collaborative manifold ranking

Multi-modal multi-scale noise-insensitive ranking for RGB-T saliency detection

Multiinteractive encoder-decoder network for RGBT salient object detection

RGB-T image saliency detection via collaborative graph learning

RGB-T salient object detection via fusing multi-level CNN features

RGBT salient object detection: A large-scale dataset and benchmark

This research was supported by a Major Project for a New Generation of AI under Grant No. 2018AAA0100400, National Natural Science Foundation of China (61922046), and Tianjin Natural Science Foundation (17JCJQJC43700).