1 Introduction

Computational methods of facial aging aim to generate an aged face while maintaining individual characteristics. However, several factors can cause aging: excess sun exposure, smoking, a polluted environment, stress, and genetic factors. In addition, surgical and non-surgical aesthetic procedures can mitigate the effects of time, as well as the use of cosmetic substances at the time of image capture. For these reasons, facial aging is a complex and non-deterministic process.

Recently, this theme has been the subject of many publications [1, 5, 8, 38], as it can help in an automated way in the search for missing people; in the identification of criminals, and also for entertainment purposes. Furthermore, its use is also present in biometric tasks (identification of individuals based on physical or behavioral characteristics), as it makes it possible to reduce the distance between the individual characteristics present at the time of training and the current state of the faces, especially if the training has been done with old images [14, 18].

Many recent publications have used generative models to perform facial aging, obtaining realistic results. Initially, adversarial autoencoders were used, and later, generative adversarial models (GANs) and, recently, diffusion models [36, 40].

This study aims to evaluate four generative models on the facial aging task. In particular, compare if zero-shot diffusion models can generate aged images on par with GANs explicitly trained for the face aging task. To achieve that, we analyze the outputs of two state of art models based on GANs, HRFAE, and SAM, with two diffusion models, Pix2pix-zero and Instruct-pix2pix. Using the following metrics: mean average error (MAE) of the predicted age (measuring age regression), Fréchet inception distance, FID (checking for image reality), and the cosine similarity of the embedding of a pre-trained face recognition network, FaceNet (gauging the adherence to the individual characteristics).

The article is organized as follows. Section 2 aims to show related works that have a degree of similarity with the study. Section 3 presents a brief review of GANs and Diffusion models, a summary of the main aspects of each model used in this comparison, a few methods used inside the models, the FFHQ Aging image database, and evaluation metrics. Section 4 details the methodology used to compare the techniques. Section 5 presents the conclusion.

2 Related Work

In the literature, many works compare the aging effects on human faces using generative models, in particular, employing generative adversarial networks, GANs, and a fewer number applying diffusion models on the task. However, we couldn’t find a work that compares these two techniques using the same dataset.

The review [8] presents the main models that perform facial aging using deep networks, the evolution of the number of publications, a taxonomy of existing techniques, and the most important databases with faces and proper metadata. Also, in the review, the authors evaluate three aging models for high-definition images recently published: HRFAE [38], LIFE [23], and SAM [1]. In the comparison, they use external photos of 4 young people, between 24 and 32 years old, and present how the models behave when trying to age the photos for three age groups: 65, 50–69, and 60. To perform a quantitative analysis, the authors estimate the age of the original photo using a pre-trained model ArcFace [4] and each generated photo. In addition, the FID metric is used to quantify the perception of the reality of the images. Finally, they measured the cosine distance between the result of the original image and the aged one in the last layer of the pre-trained network.

Moreover, in work [17], the authors review the literature and compare four aging models based on GANs: CAAE [39], IPCGAN [35] and RCRIIT [9]. In the comparison, the authors used the same set of images from three aging databases: FG-NET, UTKFaces, and CACD.

Additionally, in HRFAE [38], the authors compared their model with IPCGAN [35] and S2GAN [37] using the FFHQ Aging database.

In SAM [1], the authors compare their model with LIFE [23] and HRFAE [38], stating that at the time of publication, they were considered state-of-the-art works. The database used to evaluate the results is CelebA-HQ, as it contains celebrities’ faces and ages. As in [8], the authors used ArcFace to calculate the cosine similarity of each pair of images. The study also presents a qualitative analysis of the images generated for each work. In their analysis, the authors generated 80 images and chose the image in which the ArcFace model had the closest prediction of the desired one. The SAM and HRFAE models can generate faces for a predefined age. However, for comparison purposes, the same protocol was used. Finally, the authors used the ArcFace model during training and employed the Microsoft Azure Face API to access identity maintenance. A survey was also carried out in which humans evaluated photos of the same individual and answered which images they preferred, using the desired age and the quality of the generated image as metrics.

3 Material and Methods

3.1 Generative Adversarial Networks

Generative adversarial networks, GANs [7], are composed of two neural networks: the generator, which is a network to generate examples close to real data by learning the distribution of the training data, and the discriminator, a classification network that aims to separate the generated data from the real ones. These two networks compete with each other during the training stage, and the generator tries to produce examples so close to the real ones as to deceive the discriminator, which tries to improve the detection of the generated images. Over time architecture improvements have been introduced to improve the image quality and mitigate training problems.

HRFAE.

In High-Resolution Face Age Editing, HRFAE [38], the authors used an encoder-decoder architecture to perform age editing of photos in high resolution (1024\(\,\times \,\)1024). The G generator consists of an E encoder and a D decoder. The model receives the input image \(x_0\), which passes through the encoder generating two copies of a latent vector \(E(x_0)\). The pre-trained DEX age estimator [29] is used to determine the age of the input image \(\alpha _0\). This age will be encoded through a binary encoding module with a sigmoid activation function. Its decoder D has two tasks: it receives the latent vector E(z) and produces an image as similar as possible to the input image \(G(x_0, \alpha _0)\), and also makes the aged image \( G(x_0, \alpha _1)\) realistic and close to the desired age.

To achieve this goal, its cost function has three components, as shown in Eq. 1:

  1. 1.

    Adversarial Loss (\(L_{GAN}\)) uses PatchGAN [15] with the objective function of LSGAN [21].

  2. 2.

    Age Classification Loss (\(L_{class}\)) in which the pre-trained DEX model is used to estimate the age obtained with a categorical cross-entropy loss function.

  3. 3.

    Reconstruction Loss (\(L_{recon}\)) monitors the model’s ability to reconstruct the original image at the initial age. \(L_{recon} = \Vert G(x_0, \alpha _0)-x_0) \Vert _1\)

$$\begin{aligned} \lambda = \lambda _{recon} L_{recon} + \lambda _{class} L_{class} + L_{GAN}, \end{aligned}$$
(1)

where \(\lambda _{recon}\) and \(\lambda _{class}\) are hyper-parameters that weigh between maintaining the identity (reconstruction of the original image at the initial age) and the aging effect (classification at the correct age).

In training performed by the authors, they used the Flickr-Faces-HQ, FFHQ, and high-definition image base [16]. It presents fewer images of older people, so they used the StyleGAN network to generate 300,000 synthetic images. They thus obtained a base of images balanced in the criteria of the ages present in it. The authors only used synthetic images in age groups with insufficient real images. Ultimately, they obtained 47,990 images ranging in age from 20 to 69.

In this work (HRFAE), the authors published, along with their results, metadata of the images containing pose, age, and gender prediction, and the segmentation of the regions of the faces present in FFHQ, calling it FFHQ Aging. This is the database used in the present work.

SAM.

In Style-Based Age Manipulation, SAM [1], the authors developed an architecture that enables facial aging or rejuvenation using a real image x and a desired age \(\alpha _t\) as input. To achieve this, they perform an image-to-image translation. The first step is to find the best latent vectors representing x in the \(w^*\) space of an unconditional GAN (StyleGAN) that can reconstruct the original face, to this task, they employed a previously trained coder (pSp) [26]. Its output, a series of style vectors, makes it possible to reconstruct the original image when passing through the StyleGAN generator.

A second encoder, \(E_{age}\), is trained to capture the difference (residual) between the reconstructed image obtained by the first encoder (pSp) and the aged image. In this second encoder, a pre-trained DEX network [29] was used to guide the training toward the desired age, and a pre-trained network for face recognition ArcFace [4] was employed to maintain the individual characteristics of the original image. These networks were not changed during the training of the \(E_{age}\) encoder, keeping their parameters fixed.

The outputs of both encoders are summed and become the latent input vector that StyleGAN uses. Additionally, because it is an image-to-image translation, the model uses a cyclic loss to reconstruct the original image after a cycle consistency pass.

The cost functions used were:

  • Pixel-to-Pixel Similarity \(\mathcal {L}_2(x_{age}) = \Vert x-SAM(x_{age}) \Vert _2\)

  • Perceptual Similarity Loss: \(\mathcal {L}_{LPIPS}(x_{age}) = \Vert F(x)-F(SAM(x_{age})) \Vert _2\).

  • Regularization Loss: This regularization causes the style vectors to be close to the average of the latent vectors. The authors identified that its use improves image quality by removing unwanted artifacts in the images produced.

  • Identity Loss: Difference in cosine similarities between the output and input image, weighted by the number of years between the images. If the difference is many years, there is expected to be a loss of identity, \(\mathcal {L}_{ID}\).

  • Age Loss: To verify the quality of aging/rejuvenation in the generated image, a pre-trained DEX network was used, \(\mathcal {L}_{age} = \Vert \alpha _t - DEX(SAM(x_{age})) \Vert _2\).

3.2 Diffusion Models

Diffusion models aim to destroy the data distribution structure slowly and systematically in a way that enables learning a reverse diffusion process, which recreates the original structure of the data, generating a very flexible and computationally tractable generative model of the data [33].

In the work Denoising Probabilistic Diffusion Models (DDPM) [12], the authors achieved good results when trying to predict the noise \(\mathcal {N}(\mu ,\,\sigma ^{2})\) but fixing \(\sigma ^{2}\). The noise was also added linearly in the forward step, using a linear schedule. Moreover, the reconstruction was done using a U-Net architecture with attention blocks.

Later, some enhancements were proposed [22], adding the noise by cosine scheduling in the forward step instead of a linear scheduler, as they noted that this approach improved learning by destroying the image signal more slowly. The neural network that reconstructs the data began to learn the parameters in \(\sigma ^{2}\), increased the depth of the layers and decreased their number, increased the number of attention layers and the number of heads of attention, and other improvements.

In Denoising Diffusion Implicit Models, DDIM [34] showed that using a non-Markovian approach in the forward step of DDPM models made it possible to use only a subset (progressive) of the steps t in a sampling trajectory. Thus, it considerably accelerated the sampling process because, instead of 1,000 noise removal steps used in DDPM models, it was possible to obtain good results with 200 or fewer. Equation 2 shows the addition of noise in the sampling process obtained on the image, \(x_t\), at step t, with Gaussian noise \(\epsilon \), unitary variance \(\alpha _t\) and the original image \(x_0\). In Eq. 3 is shown the inverse, predicting the noise that will be removed \(x_t\) towards \(x_0\), where \(\epsilon _t \sim \mathcal {N}({\textbf {0}}, {\boldsymbol{I}})\).

$$\begin{aligned} x_t = \sqrt{\alpha _t}x_0 + \sqrt{1-\alpha _t}\epsilon \end{aligned}$$
(2)
$$\begin{aligned} {\textbf {x}}_{t-1} = \sqrt{\alpha _{t-1}} \underbrace{ \left( \frac{{\textbf {x}}_t - \sqrt{1-\alpha _t}e_{\theta }^{(t)}({\textbf {x}}_t)}{\sqrt{\alpha _t}} \right) }_{\text {``} x_{0}\,\, \text {predicted''}} \underbrace{ + \sqrt{1-\alpha _{t-1}-\sigma ^{2}_t} \cdot \epsilon _{\theta }^{(t)}({\textbf {x}}_t) }_{\text {``Direction towards}\,\, x_{t} \text {''}} + \sigma _t\epsilon _t \end{aligned}$$
(3)

Latent Diffusion Models.

Latent diffusion models, LDM [27] gained notoriety for being able to generate realistic text-guided images. With the availability of the pre-trained latent diffusion model Stable Diffusion, trained with high-resolution images present in the LAION database [32] quickly gained prominence due to its open source code and high-quality text-to-image generation. One of the work’s main innovations was using a dimension reduction technique, a variational autoencoder model, and performing the diffusion process in this reduced domain. Therefore, this allowed the use of self-attention modules, as these increase complexity quadratically based on the input data. Additionally, during the model training, the energy consumed and processing time decreased considerably compared to previously proposed architectures.

The model was divided into two phases:

  1. 1.

    In the compression phase, they used a variational autoencoder model to learn the perceptual domain of the images. Thus, the diffusion model does not try to add and remove noise from the raw input image, usually in high dimension, but rather from a latent vector (intermediate layer of the autoencoder used), considerably reducing the computational complexity, as it reduces the input size by eight times. This variational autoencoder uses a perceptual loss and has an opposing objective function based on image segments, thus ensuring consistency in each segment.

  2. 2.

    In the generative learning phase, a U-Net [28] structure was used with cross-attention mechanisms conditioned to different input forms (text and image) via a domain-specific encoder.

One technique used is classifier-free guidance [13], in which two text representations are concatenated, an empty array, and the input sentence. Two latent vectors are also concatenated, which will have their noise removed. Therefore, the output will have two components, one in which the input text is oriented \(\hat{x}_{cond\_text}\) and another without \(\hat{x}_{incond}\). A model parameter \(h_1\) controls how much the component linked to the input text affects the final image, as seen in Eq. 4. Larger values of \(h_1\) force the generated image to be more faithful to the input text, and smaller values give more freedom in the final image.

$$\begin{aligned} \hat{x} = \hat{x}_{incond} + h_1*(\hat{x}_{text\_cond} - \hat{x}_{incond}) \end{aligned}$$
(4)

Image Editing Using Diffusion Models. There are a few ways to perform image editing with diffusion models. Three of them are:

  • Methods that use DDIM inversion to find the latent vector that best reconstructs the original image and perform editing in the forward step of noise removal. These methods do not modify the parameters of the diffusion models used. Pix2pix-zero [24] adopts this approach.

  • Fine-tuning the weights of pre-trained diffusion models to fit the examples to be edited, as in DreamBooth [30] and Textual Inversion [6].

  • Train a diffusion model to do image editing, as in Production-Ready Face Re-Aging for Visual Effects [40] and in Instruct-pix2pix [2].

DDIM Inversion. DDIM Inversion technique [34] is a way of editing real images and consists of finding the latent variable \(x_T\), which, when traversing the deterministic sampling path, will produce a realistic approximation of the original image \(x_0\). Once this is done, it becomes possible to edit just some parts of the coding that conditions the image generation, also being able to replace words while maintaining the characteristics of the original image. Another way of editing forms involves using a mask to edit only a region of interest.

CLIP. In Contrastive Language-Image Pre-Training, CLIP [25], a neural network was trained on 400 million image pairs in the form of an image and its subtitles obtained from the internet.

This network was proposed to be used without the need to be retrained for specific tasks since it aligns in the same dimensional space both the representation of the image (after passing through an encoding model), \(I_n\), and the text (also after going through a process of tokenizing, encoding, and padding to have the same dimension as the image), \(T_n\), since both have the same dimension and the result matrix is the cross product of both \(I_n \cdot T_n \). Its objective function tries to maximize the similarity of matching pairs and minimize the similarity of unrelated pairs. The representation obtained by the network has been used in multiple image-related tasks since it aligns the images with text context.

Pix2pix-Zero. In Zero-shot Image-to-Image Translation [24], the authors propose a method for editing real images that preserve the characteristics of the original images. Following the following steps:

  1. 1.

    Perform a DDIM inversion to obtain the original latent vector that best represents this real image in the model usedFootnote 1.

  2. 2.

    Find an editing direction, they used the GPT-3 text generator template [3] to generate sentences with a source term (e.g. dog) and a target term (e.g. cat). These sentences pass through a CLIP model to obtain the representations in that domain. The subtraction of the average of these representations will, theoretically, be the editing direction of the images.

  3. 3.

    Get a caption (the most text adherent to the image) so that when editing occurs, it happens in the word that best represents the changed term. As the cross-attention modules end up generating masks relating the words to the image, when editing uses this information, it maintains what is not being changed. The authors used the BLIP model [20] to generate the caption in the work.

  4. 4.

    They performed the editing through cross-attention modules. To do so, they reconstructed the image without applying any editing, just using the input text to obtain the cross-attention modules for each step t. Once this was done, they added the editing direction and calculated the gradient loss in relation to the input \(x_t\). This caused the edit to focus on the region represented by that word.

Instruct-Pix2pix. In Instruct-pix2pix [2], a method was shown to train a model that can follow human editing instructions on images. The method receives an input image and a text with the instruction and performs editing in the forward step.

This is possible because the authors used the GPT-3 [3] template to generate editing instructions and captions for the original and edited images. After that, the authors used the pre-trained network Stable diffusion [27] to generate pairs of images referring to the created captions, producing a base of more than 450,000 examples. With these examples, a new diffusion model was trained to generate edited images given an input image and editing instructions.

The authors highlighted that text-to-image diffusion models (such as Stable Diffusion) could generate drastically different images for slightly modified texts. To mitigate this problem, the authors used the technique presented in work prompt-to-prompt [10] in which the weights in the cross-attention modules are calculated relating the words with regions of the image, so editing is restricted to the regions related to the words being edited. Furthermore, this model has a parameter \(\rho \)Footnote 2 that makes it possible to control the similarity between the two images. To do this automatically, 100 examples \(\rho \sim \ \mathcal {U}(0.1; 0.9)\) were sampled and filtered based on a distance metric in the CLIP representation space.

Another important point was that the authors used two parameters in the classifier free guidance: one that controls how much the image corresponds to the input image \(c_{image}\), and the other interferes with how much the instruction should be followed \(c_{instruction}\).

3.3 FFHQ Aging Image Database

Flickr-Faces-HQ (FFHQ) [16] is a database of high-quality human faces intended to be a benchmark for GANs. Consists of 70,000 high-resolution 1024\(\,\times \,\)1024 images obtained from the platform Flickr, chosen by their permissive sharing permissions, and pre-processed (cut and aligned) using the dlib library [19]. The authors comment that this base of images contains much more variations in terms of ethnicity and the background of the photos compared to CelebA-HQ, and that it also contains several accessories, such as glasses and sunglasses, hats, etc.

In LIFE [23], the authors published, along with their results, additional metadata containing pose, prediction of age and gender, and the segmentation of the regions of the faces present in the FFHQ database, which they called FFHQ Aging. It is worth noting that the age ranges present in the FFHQ-Aging metadata were estimates obtained by the authors using the Appen platform that, in addition to the age range, returns a confidence interval of the age prediction. To reduce the error that incorrect estimates may bring, in the comparison presented in this work, only images with a confidence interval equal to 100% were chosen.

3.4 Metrics

Fré chet Inception Distance (FID).

[11] is currently one of the most used GAN evaluation methods and uses the previously trained classifier with InceptionNet architecture. The last layer of the network (fully connected) is removed, resulting in its output of a vector with dimension 2048, representing the image of the attributes detected by this network. Statistics of these representations, both real and synthetic (generated) images, are used during the FID calculation.

Low FID values mean that the distributions are close, which is what you want to happen when comparing the distribution of the generated images and the real ones.

Cosine Similarity of Face Embeddings.

One way that multiple articles use to quantify identity maintenance in the aged-generated image is to use the embeddings of a pre-trained face recognition network [1, 23, 35] since it was optimized to find features in the faces that make each individual unique.

Therefore, calculating the cosine similarity of the original image embedding and the aged version enable quantifying how close they are in the representation space that captures the identity characteristics.

4 Experiments

These experiments aim to compare the performance of conditioned diffusion models with GANs trained specifically for the face aging task, verifying whether the aged image outputs would be of equivalent or superior quality.

The image base used was the FFHQ Aging, in which the photos have an estimated age in already defined groups. In the experiments, it was selected age ranges close to those used by the authors in their works.

In each group, 50 images of unique individuals were drawn for both women and men. An interval of 20 years was used between the age ranges, except for the last one, which has a difference of 60 years, to verify an extreme case.

In this work, we measured the mean predicted age per group and the mean absolute error (MAE), comparing the predicted age with the one estimated by the DEX model in the initial image.

The similarity of cosines was also calculated between the representations obtained when using the pre-trained network FaceNet [31]. FaceNet has a performance similar to ArcFace [4], but since SAM uses ArcFace in its cost function, the comparison is fairer using another pre-trained network.

Fig. 1.
figure 1

Examples of images in which the original image reconstruction step using the DDIM inversion technique. Some did not obtain good results, which could compromise the quality of the generated aged images.

Table 1. Model comparison between the aging images.

In the experiments carried out with the model Instruct-pix2pix changing the value of how much the aging instruction will be followed \(c_{instruction}\), it was observed that if the parameter has the value 0, the image is not aged as the instruction is not followed, which is expected. The error drops considerably by increasing the value of \(c_{instruction}\) to 10. However, the images are no longer close in the FaceNet representation. Given these results, the parameter \(c_{instruction}\) of the Instruct-pix2pix model was used with the value 3.

It is important to emphasize that the results were analyzed by separating the input images between men and women. However, as no significant differences were noted in the results, the data were unified.

In the results presented by Table 1, it is possible to notice that SAM manages to obtain the lowest error averages, possibly due to having an age estimator in its cost function. However, the similarity between the aged image and the original is smaller than the results of Instruct-pix2pix. The high similarity between the images obtained by HRFAE can be explained because the minor changes performed by the network. Therefore, the images are not properly aged, as seen in the error and low FID values.

Fig. 2.
figure 2

Comparison of results in facial aging tasks in each model. The first column, Original, is the input image with the initially estimated age. The four images to its right are the outputs of the models trying to get the image aged with the expected age. For example, 10–30 is the task of aging a face with approximately 10 years to 30 years.

Fig. 3.
figure 3

Comparison of results in facial aging tasks in each model. The first column, Original, is the input image with the estimated initial age, and the four images on the right are the results of the models trying to get the aged image with the expected age.

In the Pix2pix-zero model, two factors can affect the maintenance of the characteristics of individuals: one of them is the fact that the model uses editing directions obtained through a single average direction. This can cause the model to age all images unconditionally, which can be seen by the low similarity and high FID values. Another factor is that it needs to reconstruct the original image using the DDIM inversion technique; a poor reconstruction can compromise the final aged image. Figure 1 shows examples of when this occurs.

Observing Figs. 2 and 3, the SAM model is the only one that achieves reasonable performance when aging from approximately ten years to older ages. As it is a growth phase, the face has many changes, and the other models failed to capture these structural changes. The Pix2pix-zero model had blur results on some images, which is expected when using an average edit direction to get from one domain to the next. Instruct-pix2pix maintained the identity of the individuals and achieved considerable aging. However, in all cases, the skin texture degraded, which is common over the years (especially in cases where there is a lot of sunlight or smoking), but it is not guaranteed that aging will be like this. The SAM model seems to have more variability in how it ages faces.

5 Conclusion

This study analyzed the performance of four models in the facial aging task: two models that used GANs in their architecture and two conditional diffusion models that enabled image editing. Of these, two models stood out in this analysis, SAM, a model based on a pre-trained GAN (StyleGAN2) and specialized in the task of facial aging and rejuvenation, having components of its cost function specific to guarantee identity and age change. The second model to be highlighted was Instruct-pix2pix, a generic conditional diffusion model, which only receives instruction in the form of text, being able to edit the images in different ways in a zero-shot fashion. In the study, it was only sent an instruction to age the face to a desired age, and still got realistic aged images.

Currently, specialist GANs still demonstrate superior performance compared to generic diffusion models, such as those used in the experiments. This demonstrates a great capacity in the diffusion model since Instruct-pix2pix obtained considerable results even though it was not specifically trained for this task. Furthermore, diffusion models have gained notoriety recently, with many works presenting new techniques in recent years, so rapid advances in their architectures are expected.