1 Introduction

Facial de-occlusion and inpainting are special instances of image inpainting. They are used in the restoration of damaged images [26], removal of unwanted content [26] and data augmentation [8]. Moreover, face de-occlusion and inpainting are important preprocessing steps in many computer vision tasks with numerous applications. This is because occlusions break the entire structure of the face and hide the identity of the subject, resulting in performance degradation in many applications [37]. Occlusions degrade the performance of face parsing [30, 32], object and face detection [21] and facial expression analysis [32]. Furthermore, occlusions hide landmarks used in face alignment and frontalization [2, 41].

Face inpainting also poses several challenges that are difficult to overcome. First, the human face carries biometric information unique to each subject, revealing identity, age, sex, emotions, ethnicity, and even culture and religion. This biometric information must be preserved in the restored image. Second, there are many plausible solutions for filling the missing holes in an image, where the ground-truth is just one option. For example, given a face covered with a surgical mask, the mouth may be smiling in the ground-truth image whereas it is closed after reconstruction. Third, the set of possible solutions is restricted by the overall content, as the restoration must preserve the subject’s skin and hair texture, facial symmetry, structure and expression, along with variations of illumination and pose. Fourth, occlusions can appear anywhere in the image and may be of any shape and size. Large occlusions covering both sides off the face are more difficult to restore than small ones covering just one side. Fifth, unique facial marks such as makeup, tattoos, scars, stains, wrinkles, and accessories are difficult to recover with no reference image. Finally, the restored area must be visually consistent with the neighboring region, creating an imperceptible transition between them [1, 38, 44].

Researchers developed new methods to overcome these challenges and improve the quality of the image. The most prominent methods are GAN-based networks that are able to reconstruct an image with photo-realism. Modifications in the GAN architecture with the inclusion of new building blocks, network elements and loss functions address specific facial inpainting issues. This reviewFootnote 1 summarizes these developments, building a solid foundation for future research. The rest of the article is organized as follows. Section 2 describes network architecture and components and presents methods of training stability. Section. 3 discusses the current limitations found in the literature. Finally, we conclude in Sect. 4.

2 Theoretical Background

The network used for image inpainting consists of a number of components and elements that contribute to the final result. This section discusses the main network structures found in the literature.

2.1 Network Architecture

Generative Adversarial Networks (GAN). Generative Adversarial Networks (GAN) consist of a generator and discriminator networks [12]. The generator creates images from simple random noise, usually following a uniform or spherical Gaussian distribution [13]. The discriminator is a classifier that distinguishes between real and fake images. Both networks play an adversarial game in which the generator tries to fool the discriminator by gradually improving the image quality. They are trained in alternation until the discriminator is unable to distinguish the synthetic from real images [12]. Figure 1 shows the GAN architecture.

Fig. 1.
figure 1

Original GAN architecture proposed by Goodfellow et al. [12]. The generator receives a random noise vector as input and creates fake images. The discriminator is a classifier that evaluates whether the image is real or fake.

In image inpainting and de-occlusion, the generator input is a set of occluded images instead of random noise. After training, only the generator is used to infer new images and the discriminator is removed. Figure 2 illustrates the basic GAN architecture used in image de-occlusion and inpainting. Variations of this architecture found in the literature are described in the next sections.

Fig. 2.
figure 2

GAN architecture used in image de-occlusion and inpainting. Instead of random noise, the generator receives occluded images as input and creates occlusion-free images.

Two-Stage Network. Splitting the inpainting process into two or more stages improves the image quality. In this setting, each stage is responsible for a portion of the restoration process. The most common approaches are coarse-to-fine and prior information.

In the coarse-to-fine approach, the first stage creates an initial coarse prediction of the de-occluded image and the second stage takes the result of the first stage as input and refines the prediction [45]. This method gained popularity for its higher performance compared to single-stage networks [4, 9, 14, 44, 45]. Figure 3 illustrates a two-stage network with the coarse-to-fine approach.

Prior information such as landmarks, edges, or semantic segmentation maps provides spatial and structural information, guiding the inpainting process. This allows the inpainting network to build the face with realistic structure and facial expressions. In general, the prior network is trained to detect landmarks, edges or semantic segmentation and create the respective maps which are used by the inpainting network to guide the completion process. The landmark map improves the perceptual quality of the image, providing spatial consistency in unaligned faces [38]. The effect of landmarks in image inpainting is so strong that swapping the map of two persons changes their identities and face expressions [43]. The edge generator network predicts the edge map of the occlusion-free image, which is later used to guide the inpainting process [36, 42, 50]. The generator receives the masked grayscale ground-truth image, the masked edge map and the binary mask indicating the occluded area. Likewise, a parsing network creates an occlusion-free semantic segmentation map of the original occluded image, which guides the de-occlusion process [35, 46]. Alternatively, the parsing network can provide semantic regularization, where the semantic segmentation map of the generated image is compared with the ground-truth [11, 27].

Fig. 3.
figure 3

The two-stage architecture consists of a coarse and a refinement network.

2.2 Generator

In GANs, the generator is any neural network able to create the probability distribution of real data [12, 13]. Then, sampling from this distribution generates completely new images. The input to the generator can be a vector of random noise, the incomplete image, semantic segmentation map, edge map, landmarks or binary mask. In face image de-occlusion and inpainting, the generator can be an encoder-decoder, U-net, multi-branch network or any variation.

Encoder-Decoder and U-Net. An encoder-decoder is a generative model trained to reconstruct the input data in an unsupervised way [34]. The network has a symmetric architecture comprised of an encoder and a decoder. The encoder consists of a stack of down-sampling layers that compress the original data into a low dimensional representation. The decoder part contains a series of up-sampling layers that recover the original information. Optionally, a bottleneck layer can be inserted between the encoder and the decoder. This layer converts the encoder’s last layer into a vector with similar functionality as the random noise vector in the original GAN.

The encoder-decoder architecture with the bottleneck layer is appropriate for image inpainting with GANs. The encoder converts the occluded image into a vector, and the decoder reconstructs the de-occluded face from this vector.

The U-Net has a similar architecture as the encoder-decoder, where the main difference is the skip connections concatenating each encoder layer with the corresponding symmetrical decoder layer. In the original architecture [40], the U-Net encoder is a series of 3\(\times \)3 convolutions followed by ReLU and 2\(\times \)2 max pooling, while the decoder is a series of up sampling layers with 2\(\times \)2 kernel, a concatenation with the corresponding encoder layer and 3\(\times \)3 convolutions with ReLU. Figure 4 illustrates the encoder-decoder and U-Net architectures.

Fig. 4.
figure 4

The encoder-decoder and U-Net architectures consist of an encoder with down sampling layers, a bottleneck layer in the middle and a decoder with up sampling layers. The U-Net has skip-connections concatenating each encoder layer with the corresponding decoder layer. Left: Encoder-decoder. Right: U-Net.

Modified versions of both encoder-decoder and U-Net are commonly used in the generator of GANs used in face de-occlusion and inpainting. The variations include adding dilated convolution [21, 25, 26, 28, 29, 35, 46, 47], SE block [21, 25, 35, 47], HDC [10] and self attention blocks [33].

2.3 Discriminator

The discriminator is a classifier that calculates the probability that the image is real rather than synthesized [12]. However, since the inpainted region is a fraction of the entire image, the discriminator is biased towards a generated image being real, resulting in poor inpainting quality. This section describes variations of discriminators that address this issue.

Local and Global Discriminators. The combination of global and local discriminators improves the reconstruction realism and consistency. The global discriminator evaluates the entire image, while the local discriminator judges a small patch around the reconstructed area. The objective function is the sum of the loss functions applied to each discriminator [27]. A less common variation combines the outputs of both discriminators and converts them into a single number representing the probability that the image is real or reconstructed. Specifically, the outputs of both discriminators concatenate and then pass through a fully connected layer. In this setting, the loss is calculated at the combined output [19]. Figure 5 shows an example of an architecture of local and global discriminators with combined outputs. The architecture is a stack of \(5\times 5\) convolutions with stride 2 followed by a fully-connected layer that outputs a 1024 vector. The concatenated output of both discriminators passes through a fully-connected layer with sigmoid activation [19].

Fig. 5.
figure 5

Network architecture. It consists of one generator and two discriminators. The generator takes the occluded image as input and outputs the occlusion-free image. Two discriminators learn to distinguish the synthesized contents as real and fake. The global discriminator evaluates the entire image, while the local discriminator centers in a small area around the damaged region [19].

PatchGAN. Instead of evaluating the entire image as being real or fake like the standard discriminator, PatchGAN classifies each patch in the input image. This discriminator runs across the image like a convolution and outputs the average of all patches. PatchGAN models high-frequency details, providing texture and style lossesFootnote 2 [20]. Figure 6 shows the structure of PatchGAN.

Fig. 6.
figure 6

PatchGAN classifies each patch in the input image as real or fake. This discriminator runs across the image like a convolution and outputs the average of all patches.

SN-PatchGAN. SN-PatchGAN is a fully convolutional spectral-normalized Markovian discriminator. This discriminator computes the loss directly on each point of the last feature map. SN-PatchGAN was designed to inpaint images with regular and irregular shapes of any size and in multiple regions in the image. It provides faster and more stable training, replaces the global and local discriminators and dispenses the perceptual loss [44]. The original discriminator consists of a stack of layers of \(5\times 5\) convolutions with stride 2 and spectral normalization. SN-PatchGAN can be interpreted as a 3D classifier, where the loss is applied to each feature element on the feature map of the last layer, as illustrated in Fig. 7.

Fig. 7.
figure 7

Fully convolutional spectral-normalized Markovian discriminator (SN-PatchGAN). The discriminator loss is applied in the last feature map, resulting in a 3D classifier [44].

2.4 Building Blocks

In the context of this paper, a block is a group of layers working together that executes a specific task. A block can be inserted between two layers in the generator. This section describes the main building blocks used in GANs, such as self-attention, residual blocks and squeeze and excitation.

Self-Attention. Convolutions process local information limited to the kernel shape and size. When the kernel is inside a hole larger than the kernel size, it captures only invalid pixels, becoming unable to hallucinate meaningful pixels. Therefore, they’re not suitable for inpainting regions larger than the kernel size. On the other hand, self-attention is a non-local mechanism that creates relationships between distant regions in the image [48]. Figure 8 shows the self-attention moduleFootnote 3.

Fig. 8.
figure 8

The self-attention module creates relationships between distant regions in the image. \(\otimes \) denotes the Hadamard product.

The self-attention output is computed as follows. Let’s define C as the number of channels, N the number of feature locations from the previous layer, \(x \in \mathbb {R}^{C\times N}\) as the previous layer feature map, \(\textbf{W}_f\), \(\textbf{W}_g\), \(\textbf{W}_h \in \mathbb {R}^{\bar{C}\times C}\) and \(\textbf{W}_v \in \mathbb {R}^{C\times \bar{C}}\) as the weight matrices, and \(\bar{C} = C/8\). The feature maps \(\textbf{f}\) and \(\textbf{g}\) are calculated as \(f(x)=\textbf{W}_f x\), \(g(x)=\textbf{W}_g x\).

$$\begin{aligned} \beta _{j,i} &= \frac{\exp (s_{ij})}{\sum _{i=1}^N \exp (s_{ij})}, & s_{ij} = f(x_i)^\top g(x_j) \end{aligned}$$
(1)

where the softmax \(\beta _{j,i}\) is the probability that the \(i^{th}\) location serves the \(j^{th}\) region. The output of attention layer \(\textbf{O}=(o_1,...,o_j,...,o_N)\in \mathbb {R}^{C\times N}\) is given by:

$$\begin{aligned} & o_j = v\left( \sum _{i=1}^N\beta _{j,i}h(x_i)\right) , & h(x_i) & = \textbf{W}_h x_i, & v(x_i) & = \textbf{W}_v x_i \end{aligned}$$
(2)

The final output is given by \(y_i=\gamma o_i + x_i\), where \(\gamma \) is a learned parameter.

Residual Block. A residual block (ResBlock) consists of a series of convolutional layers with skip connection, i.e., the input adds to the output as illustrated in Fig. 9.

Fig. 9.
figure 9

A residual block consists of two or more convolution layers with skip connection where the input adds to the output. \(\phi \) is the activation function and \(\bigoplus \) is element-wise sum.

The residual block avoids gradient dispersion in very deep networks [42] and replaces the standard convolution with dilated convolution [46] or multi-dilated convolution [26]. Moreover, residual networks are easy to optimize [15], train faster and achieve similar losses compared to non-residual networks [24]. The residual block was originally conceived for image classification [15].

Residual blocks are used in bottleneck layer of encoder-decoder [3, 43, 46], contraction and expansion sides of U-Net [7, 10, 26] or as a building block of multi-branch networks [31, 32].

Squeeze and Excitation Blocks. The Squeeze-and-Excitation (SE) block models the relationships between channels in the feature maps [18]. The block performs channel-wise feature re-calibration, strengthening meaningful features and weakening worthless ones. SE blocks fit between two layers, achieving higher performance gain at a small computational cost. The squeeze operation uses global average pooling to aggregate each feature map across its spatial dimension, and the excitation operation is a simple gating that produces a collection of weights that are applied to the feature maps. Figure 10 illustrates the architecture of the SE block.

Fig. 10.
figure 10

The squeeze and excitation block scales the feature maps in a given layer according to their importance. The reduction ratio q is a hyperparameter that balances performance and computational complexity. The default is \(q=16\).

2.5 Training Stability

This section presents two approaches to stabilize the training of GANs. Zhang et al. proposed the use of spectral normalization on both the generator and discriminator, as well as employing the two time scale update rule (TTUR) [48].

Two Time Scale Update Rule (TTUR). Using different learning rates for the generator and discriminator in combination with the Adam stochastic optimization improves convergence and stability. In the two time scale update rule (TTUR), the learning rate of the generator is generally lower than the discriminator. Although the TTUR theory ensures convergence, the appropriate learning rates must be empirically found for each network [16]. The learning rates found in the literature for the generator is 1e-4 and for the discriminator are 1e-12 [22, 23], 1e-4 [11] and 4e-4 [5, 21, 48].

Spectral Normalization. Spectral normalization is a weight-normalization technique originally proposed to stabilize the training of the discriminator [39]. Spectral normalization is simple to implement, has low computation cost, and further improves stability when applied in combination with gradient penalty.

Furthermore, when employed in the generator and discriminator, spectral normalization further reduces the discriminator to generator update ratio, decreases the computational cost, and provides more stable training [48]. The spectral normalization is given by Eq. 3:

$$\begin{aligned} & \textbf{W}_{SN} = \frac{\textbf{W}}{\eta (\textbf{W})}, & \eta (\textbf{W}) = \max _{\left\| {\textbf {h}} \right\| _2 \le 1} \left| \textbf{W}{} {\textbf {h}}\right| _2 \end{aligned}$$
(3)

where \(\eta (\textbf{W})\) is the spectral norm of the matrix \(\textbf{W}\).

3 Limitations

Despite the impressive progress in image de-occlusion and inpainting over the recent years, several challenges remain to be solved. Given the broad extent of current limitations, it’s hard to imagine that a single solution will be able to address all situations. This section analyzes key limitations identified during the review and proposes open areas for research.

3.1 Datasets

Model research still requires large amounts of training data, heavily relying on available datasets, and the model generalization capabilities on unseen data largely depends on the trained dataset. An open research area is the development of models, algorithms, and methods resilient to data availability, i.e., models with high generalization capability using few training data.

Moreover, despite the variety in available data, there are few datasets created for face de-occlusion and inpainting. These datasets contain few images compared with other face databases. For this reason, researchers build their own synthetic images based on available public face datasets, usually overlaying an object or a binary mask. This approach may be good for model development, not for inference in real world scenarios which require a large occluded face dataset for testing models.

3.2 Evaluation Metrics

User study measures qualitative attributes that are hard to evaluate with quantitative methods alone. Since researchers employ different methodologies when conducting the qualitative survey, results cannot be compared across published studies. This situation could be avoided if researchers followed a formal protocol describing the survey process. The protocol might use psychophysical similarity measurements already used in the literature, such as Two Alternative Forced Choice (2AFC) and Just Noticeable Differences (JND) used in [49].

Moreover, most quantitative evaluation metrics measure pixel-level statistics that are unable to capture human perception. For historical reasons, they are still widely used for model comparison. The two most used metrics, PSNR and SSIM, carry a simple relationship between them [17]. On the other hand, feature-level metrics capture higher level perceptual quality. LPIPS is the only feature-level metric found in the literature, but it still lags behind human-level perception. More research still needs to be done in an improved version of LPIPS with higher perception level as well as quantitative metrics able to evaluate other qualitative attributes such as effective occlusion removal, naturalness, image realism, consistency, and perception quality.

3.3 Automatic De-occlusion

Most state-of-the-art models require a binary mask with holes in the occluded area. This can be useful for single image restoration and de-occlusion of photographs, but it is unfeasible for videos, real-time de-occlusion and batch processing of several images. Automatic detection and removal of occlusions fails to properly detect occlusions, producing artifacts in the restored region [6]. Moreover, there are still few studies in this area.

3.4 Image Quality

Image quality remains an open problem, for example eyes with different colors, distorted shape of mouth and nose, missing ears, texture discontinuity in the border pixels, artifacts, blur, and bad background filling.

Models using prior information such as landmarks, edges and semantic segmentation maps, or other coarse-to-fine approach, rely on the quality of the predictions of these priors. These predictions have performance degradation on occluded faces, in particular combined with large pose variations, such as top or bottom views and profile.

Large occlusions also degrade performance. The majority of models are trained with 25% of missing region, and a few restore over than 50%. It’s specially challenging to remove occlusions covering symmetric parts of the face, for example both eyes, simply because the model doesn’t know the color and shape of the eyes. In such cases, the use of prior information helps with the structure of the face, but the texture still remains missing.

3.5 Computational Cost

Current models have high computational cost, restricting the use in edge devices and in real-time applications. Inference time is still very high for real-time applications even with GPU, and the number of parameters may restrict the use in edge devices. Moreover, training a model takes a few days and model design takes a few months. More research is needed to create better algorithms and more efficient methods to reduce these computational costs.

4 Conclusion

This paper reviews GAN-based face image inpainting and de-occlusion studies found in the literature. More specifically, we explored the network architecture and components. The review also analyzed the limitations.

The GAN architecture for image inpainting and two-stage networks were described. Encoder-decoder and U-Net are basic generator architectures that can be combined with other components for additional functionalities. They are also used in single and multi-stage architectures. Local and global discriminators, PatchGAN and SN-PatchGAN improve the GAN’s ability to distinguish between real and fake images in local and global levels. Squeeze and excitation blocks perform channel-wise feature re-calibration, weighting the importance of each feature map in a given layer. TTUR, Adam optimizer and spectral normalization accelerate and stabilize GAN training.

Finally, this study discussed the current limitations and challenges found in datasets, evaluation metrics, automatic de-occlusion, image quality and computational cost. Furthermore, we propose insights for future research.