Abstract
Single image super-resolution (SISR) consists of obtaining one high-resolution version of a low-resolution image by increasing the number of pixels per unit area. This method has been actively investigated by the research community, due to the wide variety of problems ranging from real-world surveillance to aerial and satellite imaging. Most of the improvements in SISR come from convolutional networks, in which approaches often focus on the deeper and wider architectural paradigm. In this work, we decided to step up from the traditional convolutions and adopt the concept of capsules. Since their overwhelming results in image classification and segmentation problems, we question how suitable they are for SISR. We also verify that different solutions share similar configurations, and argue that this trend leads to fewer explorations of network designs. Throughout our experiments, we check various strategies to improve results, ranging from new and different loss functions to changes in the capsule layers. Our network achieved positive and promising results with fewer convolutional-based layers, showing that capsules might be a concept worth applying to the image super-resolution problem. In particular, we observe that the proposed method recreates the connection between the different characters more precisely, thus demonstrating the potential of capsules in super-resolution problems.
Artur Jordáo: This work was done when Artur Jordao was a post-doctoral researcher at the University of Campinas.
Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF
Similar content being viewed by others
1 Introduction
State-of-the-art methods in image super-resolution are based on artificial intelligence concepts, more specifically on deep neural networks, and have achieved visually striking results [6, 22, 40, 61, 63] Most recent models are composed of traditional convolutional layers that exhibit limitations, although widely studied and optimized. For example, they prevent a better understanding of the data by the network, not absorbing valuable information such as the interrelationship between its components.
As an alternative to convolutional networks, a large body of work has demonstrated the abstraction power of the capsules [14, 42, 60]. For example, Sabour et al. [41] presented an implementation of the capsule concept, a group of neurons whose activation vector represents the parameters that describe a specific entity, such as an object or part of it, and its presence. Using a three-layer capsule network, Sabour et al. achieved results comparable to those of deeper convolutional neural networks for the digit identification problem. The authors also obtained a 5% error rate on the segmentation of two different digits with an 80% overlap – a rate previously achieved only in much simpler cases (with less than 4% overlap between images). Hinton et al. [15] reduced by 45% the error rate on the problem of identifying objects from different perspectives while making the network more resistant to adversarial attacks. LaLonde and Bagci [28] have created a network able to process images tens of times larger while reducing the number of parameters by 38.4% and increasing the quality of medical image segmentation.
A few works proposed the usage of capsule-based networks to solve problems that involve low-resolution images. Singh et al. [46] proposed a Dual Directed Capsule Network, named DirectCapsNet, which employs a combination of capsules and convolutional layers for addressing very low resolution (VLR) digit and face recognition problems. Majdabadi and Ko [35] implemented a Generative Adversarial Network (GAN) that uses a CapsNet as the discriminator for facial image super-resolution, surpassing strong baselines in all metrics. Hsu et al. [18] developed two frameworks to incorporate capsules into image SR convolutional networks: Capsule Image Restoration Neural Network (CIRNN) and Capsule Attention and Reconstruction Neural Network (CARNN). The results outperformed traditional CNN methods with a similar number of parameters. Despite the positive results, most of these works rely on plain CapsNet [41], failing to explore novel capsule concepts. Our method bridges this gap.
This work focuses on the single image super-resolution problem using capsules. We implement a model based on newer concepts of capsules for single image super-resolution (SISR) problems. Throughout our evaluation, we used publicly available and standard benchmarks such as Set5 [7], Set14 [59], B100 [36], Urban100 [19] and DIV2K [2]. Our model yields fine-grained super-resolution images and achieves positive results with fewer convolutional-based layers than baselines.
2 Background
Super-Resolution. Super-resolution (SR) is the process of obtaining one or more plausible high-resolution images (HR) from one or more low-resolution images (LR) [4, 37, 56]. It is an area that has been studied for decades [21] and has a wide variety of application fields such as smart surveillance, aerial imaging, medical image processing, and traffic sign reading [4, 37, 44, 56]. The relationship between LR and HR images may vary depending on the situation. Many studies assume that the LR image is a reduced version of the HR image by bicubic interpolation. However, other degradation factors can be considered in real examples, such as quantization errors, acquisition sensor limitations, presence of noise, blurring, and even the use of different interpolation operators aiming resolution reduction for storage [50].
The first successful usage of neural networks for SISR problems was developed by Dong et al. [11]. With the Super-Resolution Convolutional Neural Network (SRCNN) model, they created a complete solution that maps LR images to SR versions with little pre/post-processing, yielding superior results. After their achievement, several other works have advanced the state-of-the-art in the SISR problem [10, 24, 25, 54], but they have strong limitations. These models receive as input an enlarged version of the LR image, usually through bicubic interpolation, and seek to improve the quality of the image. This means that the operations performed by the neural network are all done in high-resolution space, which is inefficient and incurs high processing cost. The computational complexity of the convolution grows quadratically with the size of the input image, whose generation of an SR image with a scaling factor n would result in a cost \(n^{2}\) compared to the processing in the low-resolution space [12].
Looking for a way of postponing the resolution increase in the network, Shi et al. [44] developed a new layer, called the subpixel convolution (or PixelShuffle). Such a layer works equivalent to deconvolution with kernel size divisible by spacing, but it is \(\log _{2}r^{2}\) times faster than deconvolution. Their network, named Efficient Sub-pixel Convolutional Neural Network (ESPCN), achieved speed improvements of over \(10\times \) compared to SRCNN [11] while having a higher number of parameters and achieving better results for an upscaling factor of \(4\times \).
The concept of subpixel convolution is currently a common choice to perform upscaling in neural networks. It has been used by several solutions that have reached the best results [6, 30, 33, 40, 53, 61, 63, 64] and participated in several editions of the SISR competition that took place during the New Trends in Image Restoration and Enhancement workshop [8, 50, 51, 62].
Capsules. Initially introduced by Hinton et al. [16], the concept of capsule proposes to solve some of the main flaws found in traditional convolutional networks: inability to identify spatial hierarchy between elements and lack of rotation invariance. Hinton et al. conclude that, after several stages of subsampling, these networks lose information that makes them possible to identify the spatial relationship between the elements of an image. The authors argue that, contrary to looking for a point of view invariance of the neurons’ activities that use a single output value, neural networks should use local “capsules” which learn to recognize a visual entity implicitly under a limited domain of viewing conditions and deformations. Capsule structures encode complex calculations into a small, highly informative output vector. This vector contains information such as the probability of that entity being present in a compact domain. In addition, it comprises a set of instantiation parameters, which would include deformation, pose (position, size, orientation), hue, texture, and illumination condition of the visual entity relative to the version learned by the capsule [41].
Although idealized by Hinton et al., the first successful implementation of the capsule concept was made by Sabour et al. [41]. In their work, the authors created a three-layer capsule network that achieved comparable results with the best results in the MNIST [29] digit classification problem – previously achieved by deeper networks only. For this, Sabour et al. [41] developed two innovative concepts: dynamic routing and a new activation function.
Leveraged by Sabour et al. [41], many other authors have enhanced the concept of capsules. Hinton et al. [15] proposed a new type of capsule composed of a logistic unit that indicates the probability of the presence of an entity and a pose matrix of \(4 \times 4\) representing the pose of that entity. The authors also introduced a new routing algorithm, which allows the outputs of the capsules to be routed to those of the next layer so that the active capsules receive a group of votes from similar poses. Hinton et al. showed that their model surpasses the best result in the smallNORB dataset, reducing the number of errors by more than 40% while being significantly more resistant to white-box adversarial attacks.
A remarkable work, which made possible the development of our solution, was developed by LaLonde and Bagci [28]. The authors expanded the use of capsules for the problem of object segmentation and made innovations that allowed, among other gains, to increase the data processing capacity of the capsule network, increasing inputs from \(32 \times 32\) to \(512 \times 512\) pixels. Most significantly, they advanced the state-of-the-art in the problem of segmentation of lung pathologies from computed tomography, while reducing the number of parameters by approximately 38%. In particular, the authors modified the capsule routing algorithm and the reconstruction part, and modified the concept of convolutional capsules.
Recently, a few authors have employed capsules in their solutions for problems involving LR images [18, 35, 46]. It is worth noting that most of these solutions only made small changes to the first capsule networks introduced by Sabour et al. [41] and Hinton et al. [15]. For example, Majdabadi and Ko [35] employed a two-layered capsule network with dynamic routing as the discriminator for its Multi-Scale Gradient capsule GAN. Leveraged by the matrix capsules by Hinton et al. [15], Hsu et al. [18] created two different approaches: (i) capsules as its main component in the network and reconstructing HR images directly from it (CIRNN) and (ii) capsules for the channel attention mechanism (CARNN).
3 Proposed Method
The proposed model, named Super-Resolution Capsules (SRCaps), is shown in Fig. 1. It consists of four main parts: an initial convolutional layer, followed by B sequentially connected residual dense capsule blocks, a new convolutional layer and, finally, a neural network to increase resolution. All the convolution-based layers use the weight normalization technique [43], as it accelerates the training convergence and has a lower computational cost if compared to batch normalization, without introducing dependencies between the examples of the batch [61].
The first convolutional layer, CONV ACT in Fig. 1, generates F filters from convolutional kernels of size \(k \times k\) with stride st and padding p, followed by an activation function act. This layer is responsible for converting pixel intensities to local resource detector activations that are used as inputs to the next step in the capsule blocks.
The residual dense capsule blocks, RDCBs, are composed of L convolutional capsule layers followed by an activation function act, with residual connection to their inputs, sequentially connected. The outputs of these layers are concatenated, forming a dense connection, followed by a convolutional layer, as shown in Fig. 2. This convolutional layer, with kernels \(1 \times 1\), stride 1 and padding 0, acts as a weighted sum between the various filters, allowing the network to learn which filters are more important, thus reducing dimensionality more efficiently [34, 45, 49]. The output of the RDCB is weighted by a residual scale constant.
All capsules layers within an RDCB module have the same parameters: the number of capsules per layer c, amount of filters F, kernel size k, stride st and padding p.
Our capsules employ the routing algorithm suggested by LaLonde and Bagci [28], because it provides an efficient version of the original capsule definition, as we mentioned before. This algorithm differs from the routing-by-agreement implementation by Sabour et al. [41] in two ways. First, we route capsules from the previous layer to capsules in the next layer within a specific spatial window. The original algorithm, on the other hand, directs the output of all previous layer capsules to all capsules in the next layer, varying only the routing weight. Second, we share the transformation matrices among all capsules of the same type. In a later step, we decided to replace the initial routing algorithm for the no-routing introduced by Gu et al. [13]. The authors argue that the routing procedure contributes neither to the generalization ability nor to the affine robustness of the CapsNets; therefore, distinct ways to approximate the coupling coefficients do not make a significant difference since they will be learned implicitly. In the no-routing approach, the iterative routing procedure is removed by setting all coupling coefficients as a constant \(\frac{1}{M}\), where M is the number of capsules in the next layer. We also tried different values for the squashing constant sq used in the squashing function, as done by Huang and Zhou [20].
The RDCBs are sequentially connected, each having a residual connection with the block input, followed by a new convolutional layer, and the output of that layer has a residual connection with the output of the first convolutional layer. We use residual connections, identified by the symbol
in Figs. 1 and 2, for the following purposes: they avoid the problem of vanishing gradients (it becomes zero) by introducing shorter paths, which can take the gradient over the entire length of very deep networks, as demonstrated by Veit et al. [52]; and the use of residual connections seems to greatly improve training speed [48].
At the end of the model, there is a network used to upscale, called up-sampling network (UPNet) (see Fig. 3). Following previous works [33, 64] and several participants of the NTIRE 2017 [50] and NTIRE 2018 [51] competitions, the UPNet is composed of subpixel convolutions [44]. The UPNet allows the network to implicitly learn the process required to generate the larger version by adding the LR space feature maps and creating the SR image in a single step, saving memory and processing. We prefer this method over deconvolution since it naturally avoids checkerboard artifacts, which with deconvolution must be done using a kernel size that is divisible by stride to avoid the overlapping problem as demonstrated by Odena et al. [38]. Besides, UPNet has a considerably lower computational cost, becoming \(\log _{2}r^{2}\) times faster during training [44].
Loss Functions. During training, we evaluated loss functions commonly used in super-resolution problems [56]. Due to its simplicity and effectiveness, the first loss function we assess is the L1. Previous studies [33, 65] showed that a network trained with L1 has achieved superior results compared to the same network trained with L2.
Still in the work of Zhao et al. [65], the idea of using indices based on the Structural Similarity Index (SSIM) [57] is introduced for training neural networks. As previously noted by Dong et al. [11], if a metric based on visual perception is used during training, the network can adapt to it. The SSIM is calculated as:
where \(\mu _{x}\) and \(\mu _{y}\) are the average pixel values in the SR and HR patches, respectively, \(\sigma _{x}\) and \(\sigma _{y}\) are the standard deviations of the same patches, \(\sigma _{x y}\) is the covariance between them, and \(C_{1}\) and \(C_{2}\) are constants added to avoid instabilities when the values of \(\mu _{x}^{2}+\mu _{y}^{2}\) and \(\sigma _{x}^{2}+\sigma _{y}^{2}\) are very close to 0. The l(p) part of the equation calculates the comparison between clipping luminances, while the comparison between their contrasts and structures is calculated by cs(p). Since the highest possible value for SSIM is 1, and because training a neural network usually aims to minimize the loss function, we can define the \(\mathcal {L}^{\text {SSIM}}\) function as:
in which \(\tilde{p}\) is the central pixel of patch p.
The best performance in the work of Zhao et al. [65] was obtained by combining L1 and the Multi-Scale Structural Similarity Index (MS-SSIM) shown in Eq. 4 weighing 0.16 and 0.84, respectively. The authors argue that MS-SSIM preserves contrast in high-frequency regions, while the L1 preserves color and brightness regardless of the local structure. The MS-SSIM value is obtained by combining measurements at different scales using the Equation:
where scales are used ranging from 1 (original image) to M (the largest scale used), reducing the image by a factor of 2 every iteration; \(l_{M}\) and \(cs_{j}\) are the same terms as defined in Eq. 3 at M and j scales, respectively, while the \(\alpha \) and \(\beta _{j}\) exponents are used to adjust the relative significance of different components. It is worth noting that the luminance comparison (\(l_{M}\)) is calculated only at M scale, while contrast and structure comparisons (\(cs_{j}\)) at each scale. As with SSIM, the largest possible value for MS-SSIM is 1, so we can use it in a loss function in the form:
We also explored the combination of functions employing several different layers of the network, as suggested by Xu et al. [58], in which the weighted sum between the calculation of the L1 function after two, three and four residual blocks are used, with weights of 0.5, 0.5 and 1, respectively. After each residual block composing the loss, we add a network based on subpixel convolutions to perform upscaling. Besides the above settings, we investigate the benefits of edge maps in the loss function, since L1 may smooth the edges. Similarly to Pandey et al. [39], we evaluated a combination of the L1 using the SR and HR images and the L1 between its edge maps. However, although their work uses the Canny operator [9] to generate the edge map, our work investigated the usage of the Sobel operator [47].
We also consider the three-component weighted PSNR (3-PSNR) and SSIM (3-SSIM) loss functions [31]. Importantly, such metrics can measure the quality of images and videos. This approach breaks up an image into three parts: edges, textures, and more homogeneous regions. To do this, the Sobel operator is applied to the luminance channel of the image and, from the highest calculated value and some pre-established values, the thresholds that delimit each region are calculated.
The value of each metric is calculated by applying different weights for each region. Li and Bovik [31] showed that the weights that achieved the best results were 0.7, 0.15 and 0.15 for 3-PSNR, and 1, 0 and 0 for 3-SSIM, considering edges, textures, and homogeneous regions, respectively. These values are consistent with the observation that perturbations at the edges of an object are perceptually more significant than in other areas. Based on recent solutions available in the literature, Barron [5] presented a loss function that is a superset of Cauchy/Lorentzian, Geman-McClure, Welsch/Leclerc, generalized Charbonnier, Charbonnier/pseudo-Huber/L1-L2, and L2. This function has two hyperparameters: robustness (\(\alpha \)) and scale (c), with the variation of which is possible to reach all previous functions as specific cases. The general loss function is calculated as follows:
where x is the difference between the HR and SR pixel values. Barron also showed that it is possible to modify its function so that the network learns optimal values for the \(\alpha \) and c parameters, thus providing an appropriate exploration by the network of different error functions. Due to its unique features, we use the adaptive loss function for the SRCaps training.
4 Experimental Results
Datasets. In this work, we employed datasets widely used in the literature [4, 56]. Currently, the DIV2K training set is used in the training of neural networks for the super-resolution problem, while the validation set of DIV2K, B100, Set5, Set14, and Urban100 are used to validate the results. All datasets are composed of original versions of the images (HR) and their reduced versions by bicubic interpolation algorithm for \(2\times \), \(3\times \), and \(4\times \). In this work, we focus on the \(4\times \) scale factor.
Metrics. For the validation process of the results obtained, we employed metrics commonly used in the literature. More specifically, Peak Signal-to-Noise Ratio (PSNR) [17], Structural Similarity Index (SSIM) [57] and Multi-Scale Structural Similarity Index (MS-SSIM) [55]. Due to space limitations, we refer interested readers to the work developed by Wang et al. [56] for a detailed formulation.
Algorithms that measure the differences between two images often assume that the images are shown side by side or alternated with an empty image displayed in between for a short period before the next image is shown. In contrast, flipping (or alternating) between two similar images reveals their differences to an observer much more effectively than showing the images next to each other. Aiming to better approximate human evaluators’ methods, Andersson et. al. [3] developed a full-reference image difference algorithm, namely
, which carefully evaluates differences inspired by models of the human vision.
metric the lower the better.
is designed to have both low complexity and ease of use. It not only evaluates differences in colors and edges, but also pays attention to discrepancies in isolated pixels with colors that greatly differ from their surroundings.
outputs a new image indicating the magnitude of the perceived difference between two images at every pixel. The algorithm can also pool the per-pixel differences down to a weighted histogram, or generate a single value, which is the approach we will use during our analysis. This value is zero when both images are the same, and it increases as the more noticeable are the differences.
Computational Resources. The Super-Resolution Capsules (SRCaps) model was implemented using the PyTorch open-source platform on the PyTorch Lightning wrapper. The implementation made available by UchidaFootnote 1 was used as a foundation, and also metrics from the PyTorch Image Quality collection [23] and the official implementation of the
metric [3]. For optimizers implementations, we used the torch-optimizer packageFootnote 2 and for comparative visualization of the metrics and generation of graphs we employed Tensorboard [1] and Comet.mlFootnote 3 tools.
Parameter Search. To find the best set of parameters for our SRCaps model, we used the Ray [32] open-source framework (Tune), which is a scalable hyperparameter tuning library built on top of Ray CoreFootnote 4. We employed the Async Successive Halving (ASHA) scheduler during the search, as it decides at each iteration which trials are likely to perform badly, and stops these trials, avoiding wasting resources on poor hyperparameter configurations. We train the models for a maximum of 100 epochs in the DIV2K training dataset and evaluate them using the MS-SSIM performance on the first 10 images from the DIV2K validation dataset. We select these values to allow fast experimentation of different sets of parameters, as it was observed during numerous training processes that the model tends to start stabilizing performance at around 100 epochs.
Experimental Setup. During the training process of the different models used for comparison, the entries have \(N = 16\) (batch size) pairs of image clippings from the dataset. For all models, during validation, the LR and HR images are used entirely one by one (\(N = 1\)). All models discussed here were evaluated for \(4\times \) super-resolution for 2000 epochs, where each epoch involves only one iteration through the training dataset, and is trained with two different loss functions: L1 and adaptive. Other functions have been evaluated, and their results will be briefly discussed throughout this section.
The final SRCaps model has convolutional kernels with \(k = 3\) in its first layer, followed by 7 RDCBs (\(B = 7\)). The value used for the hyperparameters L and c are the same for all blocks: 3 and 4, respectively. In the last convolutional layer, we chose \(k = 3\), as well as for the convolutional layers internal to the UPNet. Throughout the neural network, we used the values of \(act = ReLU\), \(k = 3\), \(F = 128\), \(st = 1\) and \(p = \text {'same'}\). Setting the padding to the ’same’ mode means using its value so that the input (H and W) dimensions are preserved in the output, that is, \(p = \frac{k-1}{2}\). Although having a smaller number of layers and only seven residual blocks in its composition, the SRCaps network has a considerable number of parameters - 15M. Such a value represents 13.5M, 2.4 and 10.2 more parameters than EDSR, RCAN and WDSR, respectively. This is due to the vectorial nature of the capsule, which adds an extra dimension to its composition.
HR image slices (\(patch\_size\)) of size 128\(\times \)128 and its corresponding 32\(\times \)32 LR slice were used during their training. The updating of the weights of the networks is done by the Adam optimizer [27], with \(\beta _{1} = 0.9\), \(\beta _{2} = 0.999\), and \(\epsilon = 10^{-8}\), being the networks trained with an initial learning rate of \(lr = 10^{-4}\) which decays to half of the current value every 500 epochs.
When used, the adaptive loss function was initialized with the default values from the official implementation [5]. Employing these values is equivalent to starting the training with the Charbonnier/Pseudo-Huber function and letting the network learn from it what values for its parameters and, consequently, which function of the subset of the general function is more appropriate.
Baselines and Results. The EDSR model used is the base model defined by Lim et al. [33]. We chose the simplest version of the model because it has a smaller number of parameters than the SRCaps model. It is composed of 16 residual blocks without residual scale application since only 64 feature maps (filters) are used per convolutional layer. All convolutions, including the ones internal to the upscale network, have a kernel size of 3\(\times \)3. During its execution, the input images are subtracted from the mean RGB values of the training images of the DIV2K dataset, which are 0.4488, 0.4371 and 0.4040. These values range from 0 to 1 and are multiplied by the maximum pixel value, 255. The RDN [64] model, as well as the EDSR, generates 64 filters as the output of its convolutional layers, and uses \(k = 3\) for all convolutional kernels, except for those used in the fusion layers of LFF and GFF, which have \(k = 1\). In this model, 16 RDB blocks were used, with 8 convolutional layers each.
For the WDSR model, we used the original implementation by Yu et al. [61]. The chosen version was wdsr-b, with 16 large residual blocks that generate 128 filters, but which internally generate convolutional layers with \(6 \times \) more filters. This model, such as EDSR, also subtracts the mean RGB values from the DIV2K images. The RCAN model we used is also the original implementation [63] and is available in the same repository used as a baseline. It is composed of 10 residual groups (RG) that form the Residual in Residual (RiR) network structure, in which each RG is composed of 16 Residual Channel Attention Blocks (RCAB). It has \(k = 3\) kernel sizes which generate \(C = 64\) filters in all convolutional layers, except for those in the channel reduction and amplification mechanism, which have \(k = 1\), and \(\frac{C}{r}=4\) and \(C = 64\) respectively, with reduction factor \(r = 16\).
Table 1 summarizes the results obtained for all models and datasets after the learning phase. From this table, we highlight the following points. First, the SRCaps model obtained comparable results to the EDSR model, sometimes surpassing it for some metrics, particularly in the B100 dataset. Second, as shown in Figs. 4 and 5, the proposed model manages to recreate the connection between the different characters more precisely, while models with better metrics such as RCAN and RDN tend to thin the connection, as they do for the leftmost part of the symbol on top. Finally, despite being able to reconstruct with quality rounded edges, a deficiency of the SRCaps model is in the reconstruction of linear edges, usually diagonal.
It is remarkable the results obtained with the RCAN model, reaching the highest value in all metrics. We highlight that our goal is not to push the state-of the art but to bring insights into capsules applied to super-resolution problems. We believe that future research on capsule networks could benefit from our findings.
5 Conclusions
The purpose of this work was to evaluate the use of the capsule concept in the solution of single image super-resolution problems, as well as to verify new forms of training and validate the results of neural networks for this purpose. It was evidenced that, despite the inferior result, a trained network with a smaller number of layers obtained a relevant result, indicating that networks that use capsules can have applications in super-resolution. Hypotheses have been raised that the nonlinearity function applied together with the capsules may be a limiting factor, given the different nature of the problem as to its initial usage (super-resolution \(\times \) classification).
Throughout our work, we investigate the contribution of many hyperparameters such as the activation function, learning rate, optimizer and loss function. Regarding the latter, we evaluate loss functions that take the human visual system into account, as suggested by previous works [55, 57]. Additionally, we study different architectural designs to compose our capsule network. The fact that the adaptive function [5] is a superset of several others and that it is possible to make the network learn, along with the other weights, the optimal values for its two main parameters (\(\alpha \) and c), allow the network to experiment which loss function best fits the problem. Thus, it is possible to train the network starting from a function similar to L1, while modifying it at each iteration to extract as much useful information as possible from the data. The current limitations of the most used metrics in the literature [31] were also emphasized in this work, showing that visual evaluation of the results is still essential. Existing metrics have been suggested such as MS-SSIM [55] and
[3], encouraging discussion of new metrics.
Several points of possible improvements that could not be deeply evaluated were identified. As a future line of research, we intend to replace the composition of the UPNet network, which is used in much the same way by several networks that have reached the state of the art. One can again verify the usage of the concept of reverse convolutions, or deconvolutions, as used by Dong et al. [12], and also of deconvolutional capsules created by LaLonde and Bagci [28], or use more recent methods from the literature. Kim and Lee [26] recently proposed the enhanced upscaling module (EUM), which achieves better results through nonlinearities and residual connections.
References
Abadi, M., et al.: TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems (2015). Software available from tensorflow.org
Agustsson, E., Timofte, R.: NTIRE 2017 challenge on single image super-resolution: dataset and study. In: CVPR Workshops, pp. 1–8 (2017)
Andersson, P., Nilsson, J., Akenine-Möller, T., Oskarsson, M., Åström, K., Fairchild, M.D.: FLIP: a difference evaluator for alternating images. In: ACM on Computer Graphics and Interactive Techniques (2020)
Anwar, S., Khan, S.H., Barnes, N.: A deep journey into super-resolution: a survey. ACM Comput. Surv. 53(3), 60:1–60:34 (2020). https://doi.org/10.1145/3390462
Barron, J.T.: A More General Robust Loss Function. arXiv preprint arXiv:1701.03077 (2017)
Behjati, P., Rodriguez, P., Mehri, A., Hupont, I., Tena, C.F., Gonzalez, J.: OverNet: lightweight multi-scale super-resolution with overscaling network. In: WACV, pp. 1–11 (2021)
Bevilacqua, M., Roumy, A., Guillemot, C., Alberi-Morel, M.L.: Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In: BMVC (2012)
Cai, J., Gu, S., Timofte, R., Zhang, L.: NTIRE 2019 challenge on real image super-resolution: methods and results. In: CVPR Workshops, pp. 1–8 (2019)
Canny, J.: A computational approach to edge detection. Trans. Pattern Anal. Mach. Intell. 8(6), 679–698 (1986)
Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks. Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2016)
Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for image super-resolution. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 184–199. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10593-2_13
Dong, C., Loy, C.C., Tang, X.: Accelerating the super-resolution convolutional neural network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 391–407. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_25
Gu, J., Tresp, V.: Improving the robustness of capsule networks to image affine transformations. In: CVPR, pp. 1–15 (2020)
Gu, J., Wu, B., Tresp, V.: Effective and efficient vote atack on capsule networks. In: ICLR (2021)
Hinton, G., Sabour, S., Frosst, N.: Matrix capsules with EM routing. In: ICLR, pp. 1–10 (2018)
Hinton, G.E., Krizhevsky, A., Wang, S.D.: Transforming auto-encoders. In: Artificial Neural Networks and Machine Learning, pp. 44–51 (2011)
Hore, A., Ziou, D.: Image quality metrics: PSNR vs. SSIM. In: ICPR, pp. 2366–2369 (2010)
Hsu, J., Kuo, C., Chen, D.: Image super-resolution using capsule neural networks. IEEE Access (2020)
Huang, J.B., Singh, A., Ahuja, N.: Single image super-resolution from transformed self-exemplars. In: CVPR, pp. 1–9 (2015)
Huang, W., Zhou, F.: DA-CapsNet: dual attention mechanism capsule network. Sci. Rep. (2020)
Irani, M., Peleg, S.: Improving resolution by image registration. In: CVGIP: Graph. Model. Image Process. 53(3), 231–239 (1991)
Ji, X., Cao, Y., Tai, Y., Wang, C., Li, J., Huang, F.: Real-world super-resolution via Kernel estimation and noise injection. In: CVPR Workshops, pp. 1–8 (2020)
Kastryulin, S., Zakirov, D., Prokopenko, D.: PyTorch image quality: metrics and measure for image quality assessment (2019). https://github.com/photosynthesis-team/piq
Kim, J., Kwon Lee, J., Mu Lee, K.: Accurate image super-resolution using very deep convolutional networks. In: CVPR, pp. 1–8 (2016)
Kim, J., Kwon Lee, J., Mu Lee, K.: Deeply-recursive convolutional network for image super-resolution. In: CVPR, pp. 1–13 (2016)
Kim, J.H., Lee, J.S.: Deep residual network with enhanced upscaling module for super-resolution. In: CVPR Workshops, pp. 1–15 (2018)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) ICLR (2015)
LaLonde, R., Bagci, U.: Capsules for Object Segmentation. arXiv preprint arXiv:1804.04241 (2018)
Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Ledig, C., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: CVPR, pp. 1–8 (2017)
Li, C., Bovik, A.C.: Content-weighted video quality assessment using a three-component image model. J. Electron. Imag. 19, 19 (2010)
Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez, J.E., Stoica, I.: Tune: a research platform for distributed model selection and training. arXiv preprint arXiv:1807.05118 (2018)
Lim, B., Son, S., Kim, H., Nah, S., Lee, K.M.: Enhanced deep residual networks for single image super-resolution. In: CVPR Workshops, pp. 1–8 (2017)
Lin, M., Chen, Q., Yan, S.: Network in Network. arXiv preprint arXiv:1312.4400 (2013)
Majdabadi, M.M., Ko, S.B.: Capsule GAN for Robust Face Super-Resolution. Multim. Tools Appl. (2020)
Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: ICCV, vol. 2, pp. 416–423 (2001)
Nasrollahi, K., Moeslund, T.B.: Super-resolution: a comprehensive survey. Mach. Vis. Appl. 25(6), 1423–1468 (2014)
Odena, A., Dumoulin, V., Olah, C.: Deconvolution and checkerboard artifacts. Distill (2016). http://distill.pub/2016/deconv-checkerboard
Pandey, R.K., Saha, N., Karmakar, S., Ramakrishnan, A.G.: MSCE: an edge preserving robust loss function for improving super-resolution algorithms. arXiv preprint arXiv:1809.00961 (2018)
Ren, H., Kheradmand, A., El-Khamy, M., Wang, S., Bai, D., Lee, J.: Real-world super-resolution using generative adversarial networks. In: CVPR Workshops, pp. 1–8 (2020)
Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. In: NeurIPS, pp. 3856–3866 (2017)
Sabour, S., Tagliasacchi, A., Yazdani, S., Hinton, G.E., Fleet, D.J.: Unsupervised part representation by flow capsules. In: Meila, M., Zhang, T. (eds.) ICML, vol. 139, pp. 9213–9223 (2021)
Salimans, T., Kingma, D.P.: Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In: NeurIPS, pp. 901–909. Curran Associates, Inc. (2016)
Shi, W., et al.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: CVPR, pp. 1–8 (2016)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Singh, M., Nagpal, S., Singh, R., Vatsa, M.: Dual directed capsule network for very low resolution image recognition. In: ICCV, pp. 1–8 (2019)
Sobel, I., Feldman, G.: A \(3\times 3\) Isotropic Gradient Operator for Image Processing (1968). Talk at the Stanford Artificial Project
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI, pp. 4278–4284 (2017)
Szegedy, C., et al.: Going deeper with convolutions. In: CVPR, pp. 1–8 (2015)
Timofte, R., et al.: NTIRE 2017 challenge on single image super-resolution: methods and results. In: CVPR Workshops, pp. 1110–1121 (2017)
Timofte, R., Gu, S., Wu, J., Van Gool, L.: NTIRE 2018 challenge on single image super-resolution: methods and results. In: CVPR Workshops, pp. 1–17 (2018)
Veit, A., Wilber, M.J., Belongie, S.: Residual networks behave like ensembles of relatively shallow networks. In: NeurIPS, pp. 550–558 (2016)
Wang, X., et al.: ESRGAN: enhanced super-resolution generative adversarial networks. In: ECCV, pp. 63–79 (2019)
Wang, Z., Liu, D., Yang, J., Han, W., Huang, T.: Deep networks for image super-resolution with sparse prior. In: ICCV, pp. 370–378 (2015)
Wang, Z., Simoncelli, E.P., Bovik, A.C.: Multiscale structural similarity for image quality assessment. In: The Thirty-Seventh Asilomar Conference on Signals, Systems Computers, vol. 2, pp. 1398–1402 (2003)
Wang, Z., Chen, J., Hoi, S.C.H.: Deep learning for image super-resolution: a survey. Trans. Pattern Anal. Mach. Intell. 43(10), 3365–3387 (2021)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. Trans. Image Process. 13(4), 600–612 (2004)
Xu, J., Zhao, Y., Dong, Y., Bai, H.: Fast and accurate image super-resolution using a combined loss. In: CVPR Workshops, pp. 1093–1099 (2017)
Yang, J., Wright, J., Huang, T.S., Ma, Y.: Image super-resolution via sparse representation. Trans. Image Process. 19(11), 2861–2873 (2010)
Yu, C., Zhu, X., Zhang, X., Wang, Z., Zhang, Z., Lei, Z.: HP-capsule: unsupervised face part discovery by hierarchical parsing capsule network. In: CVPR, pp. 4022–4031 (2022)
Yu, J., Fan, Y., Yang, J., Xu, N., Wang, X., Huang, T.S.: Wide Activation for Efficient and Accurate Image Super-Resolution. arXiv preprint arXiv:1808.08718 (2018)
Zhang, K., Gu, S., Timofte, R.: NTIRE 2020 challenge on perceptual extreme super-resolution: methods and results. In: CVPR Workshops, pp. 1–10 (2020)
Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Fu, Y.: Image super-resolution using very deep residual channel attention networks. In: ECCV, pp. 1–8 (2018)
Zhang, Y., Tian, Y., Kong, Y., Zhong, B., Fu, Y.: Residual dense network for image super-resolution. In: CVPR, pp. 2472–2481 (2018)
Zhao, H., Gallo, O., Frosio, I., Kautz, J.: Loss functions for image restoration with neural networks. IEEE Trans. Comput. Imaging 3(1), 47–57 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
de Araújo, G.C., Jordão, A., Pedrini, H. (2023). Single Image Super-Resolution Based on Capsule Neural Networks. In: Naldi, M.C., Bianchi, R.A.C. (eds) Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science(), vol 14197. Springer, Cham. https://doi.org/10.1007/978-3-031-45392-2_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-45392-2_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45391-5
Online ISBN: 978-3-031-45392-2
eBook Packages: Computer ScienceComputer Science (R0)





