1 Introduction

No society, however low it may have been its level of material accomplishment, has ceased to produce art [15]. Art has been, through the ages, a mirror reflecting society; the purpose of art has varied but its importance in capturing society has been constant. One way to discuss the content of artwork is through style categories. Such categories indicate connections between paintings and help us to better understand their meanings. Even if many art theorists look at style categories with reservations,Footnote 1 categories provide an important guideline for beginner art aficionados and the general crowd within museums and galleries. Besides, style categories can be used by recommendation tools to improve a museum visitor’s experience, either in a traditional physical museum location or in one of the increasingly common digital galleries.

Alas, the current quantity of online art has now surpassed our manual indexing abilities. The field of computer vision has developed artwork style classification tools using a diversity of techniques from feature-based machine learning [2, 32] to deep learning [3, 10, 23]. Beyond the challenge of class imbalance, style classification is further complicated by aesthetic diversity, a problem that is not usually met in most image classification models. For instance, the Baroque is directly linked to the historic moment of the Catholic Church struggling against the Protestant Reformation; in an attempt to combat the Protestant faith, the Church announced during the Council of Trent (1545–1563) the dissemination of religious ideas through images to awaken religious fervor—therefore, the Baroque works are very concentrated on the religious theme. Minimalism, on the other hand, does not have a single theme, but it employs a few elements in a distinctive manner [15].

The use of image augmentation is well established in computer vision so as to enhance the content of a training dataset. But usual techniques do not produce good results when applied to artwork; for instance, many of the common image augmentation techniques modify the resulting color of the image, which can be problematic in the case of artwork as colors often have or convey particular meaning. In this paper we propose an artwork style classifier that resorts to Generative Adversarial Networks (GANs) as an image augmentation tool for oversampling.

Artwork generation with GANs is not a direct matter; even high performance GAN architectures did not do well in preliminary experiments, so we had to propose techniques to enhance image quality and diversity. We describe in this paper a methodology for data augmentation that is tuned to the requirements of artwork classification; we also describe a specialized network architecture and an adjusted loss function that both contribute to the overall performance of our proposal. In short, our contributions are both in adopting GANs for data augmentation in the particular context of artwork classification, and in proposing novel techniques that may be of broad interest whenever similar classification problems are met.

The paper is organized as follows. Section 2 analyzes related work on art style classification and Generative Adversarial Networks. Section 3 describes our approach; results and analysis are shown in Sect. 4. Finally, conclusions are presented in Sect. 5.

2 Related Work

In this section, we explore relevant efforts on artwork classification. We also present the architectural development of Generative Adversarial Networks (GANs), its use in the art domain and its relevance to image augmentation.

2.1 Artwork Classification

Initial research on artwork classification was focused on feature-based machine learning [2, 32]. Since the initial success of a convolutional neural network (CNN) in object classification for ImageNet [25], several studies in artwork classification have explored these networks [3, 10, 16, 23]. To improve the performance of the classification of art styles with CNNs, researchers have investigated the benefits of pre-trained weights from the ImageNet Challenge instead of random startups [26, 36].

There has been significant activity in artwork classification in recent years. Cetinic et al. expanded artwork classification beyond style, genre, artist and time period classification, examining also nationality and testing scene recognition and sentiment analysis techniques [6]. Chu et al. studied the style classification task focusing on describing image texture with deep learning. They investigated the intra-layer and inter-layer correlations in order to create deep features for style classification. [9]. The work of Elgammal et al. analyzed the learned representations of a fine-tuned ResNet-152, noting that some of the style patterns designed by Heinrich Wölfflin (1846–1945) correlate with the PCA decomposition of these learned representations [14]. In Rodriguez et al., five image patches of painting were used for training and weights for each patch were optimized in order to improve accuracy of the final model [29]. Sandoval et al. also worked with image patches, but in a two-stage deep learning approach, in which these five patches are trained independently at a first step. At the second stage, the outcome of these patches are fused to a second shallow neural network for the final decision [31]. Zhong et al. presented a two-channel dual path network and two inputs are used: the RGB image and four-directional gray-level co-occurrence matrix for detecting the brush stroke information [44]. The work of Zhu et al. not only trained the Inception V3 network for classifying nine artistic movements, but also used Grad-CAM heat map for visualizing the areas of the images the model was focusing for class prediction [46]. Chen et al. presented an adaptive cross-layer correlation for artwork classification, in which it adaptively weights features in different spatial locations based on similarity [7]. Bianco et al. studied the advantages of training a model with the full image of the paintings and also its crop at different resolutions [4].

2.2 Generative Adversarial Networks

Generative Adversarial Networks (GANs) are built by training two networks with a minimax game framework, where we have a model capable of generating synthetic data (model G) and another model that evaluates whether the data is real or synthetic (model D) [19]. In the work of Mirza and Osindero, an extra parameter was introduced in the GAN architecture so as to allow the generator model to create images according to class labels [27]. Chen et al. developed a GAN architecture to learn disentangled representations in an unsupervised manner; they introduced a representation learning algorithm called Information Maximizing Generative Adversarial Networks (InfoGAN), in which an information-regularized minimax game is used in order to train a multi class generative model without the label information [8].

Some work on GANs has focused on the impact of the loss function. Arjovsky et al. used the concept of the Earth Mover (EM) distance, also known as Wasserstein-1 distance, as a loss function [1]. Gulrajani et al. improved the Wasserstein function loss adding a gradient penalty [20].

An important concept in the evolution of the GAN architecture is the self-attention mechanism. The convolutional architecture processes information in local neighborhoods and it has no mechanism to deal with long distance dependencies. The self-attention enables both the generator and the discriminator to deal with widely separated spatial regions. The work of Zhang et al. presented this concept as Self-Attention GAN (SAGAN) [43] and other proposals have used this architecture as a reference [5, 11].

The use of synthetic images to improve classifiers has been adopted particularly in the medical field, for instance in simulating lung nodules [21], ECG [42], liver lesions [17], chromosomes [40], skins lesions [28] and Covid-19 results [12, 38]. We also mention the work of Suh et al., where classifier loss is included in the GAN training process in order to reduce ambiguity between classes: their classification enhancement generative adversarial networks (CEGAN) consist of three independent networks – a discriminator, a generator and a classifier – using WGAN-GP for classification under imbalanced data conditions [34].

Generative adversarial networks have already been used in the art domain. For instance, one can find a generator for image style transfer (the CycleGAN [45]), an Image-to-Image translator from art to real images [18, 37], and a model specialized in creating Chinese landscapes [41]. The first, and to the best of our knowledge the only, work aimed at creating artwork with the WikiArt dataset appeared in Ref. [13]. There, Elgammal et al. developed the Creative Adversarial Network to creatively generate artwork by maximizing deviation from established styles and by minimizing deviation from art distribution [13]. The authors found that the way to encourage the generator “to be creative" was to penalize it any time in which it was too easy for the Discriminator to identify the synthetic image as being art from a certain style.

3 Our Proposal

Simply put, our main goal is to enhance the accuracy of artwork style classification. In pursuing this, we were led to study the potential benefits of synthetic images generated by Generative Adversarial Networks. As noted before, synthetic art generation has been employed only by Elgammal et al. (2017), but not with the purpose of enhancing classification; to the best of our knowledge, the latter task has not been investigated yet. In short, our specific goal here is not to make art, but to improve art style classification.

We introduce a strategy that we refer to as Class-by-Class Performance Analysis. In order to start up learning, a baseline model is trained without image augmentation techniques. The next step is to explore the benefit of geometric transformations. We then concentrate on classes with the lowest performance; we want to maximize the information about these classes. Such information is enhanced by a version of GANs with a self-attention mechanism for image diversity that is trained with the Wasserstein with Gradient Penalty loss function for avoiding vanishing gradient.

In addition, classes with low performance may or may not contain a small number of images. Hence, we suggest two strategies when we sample either:

  • low quantity classes: add a multiple of the number of original images;

  • high quantity classes: add a fraction of the number of original images.

There are many decisions that must be set in implementing this strategy; some of them depend on the particular artwork collection is dealing with. In the following subsections, all relevant aspects of our strategy are discussed: our study of image augmentation, our choices regarding the GAN architecture and the loss function and, finally, the model architecture for our classification task.

3.1 Image Augmentation

Data Augmentation is a natural solution to the problem of limited data [33]. With respect to images, augmentation techniques consist of geometric transformations, such as rotation, image cropping, flipping and color conversions [34]. We studied these techniques in the context of art style classification, a setting with many classes and serious class imbalance.

Our first experiment was to ascertain the benefits of rotation, image cropping and flipping. Results can be seen in Table 6. Improvement was obtained by rotating the image between -10 and 10 degrees, horizontal flips and random crops. Alas, traditional image augmentation techniques often generate simple and redundant copies of the original data in many cases [34]. In some cases, data augmentation may not be a profitable idea. For many domains, color conversions offer image diversity; for instance, the object class “bicycle" can be represented with red or blue bicycles. In artwork, however, colors are meaningful and altering them without care can result in an image that lies in a category different from the original one. For example, Pablo Picasso’s Blue Period paintings (1901–1904) should always be represented by gloomy shades of grayish blue. The true atmosphere color for Édouard Manet should always be violet and many of his Impressionist colleagues used violet as their main color [24].

3.2 Generative Adversarial Network

One of the most challenging tasks for Generative Adversarial Networks is producing a diverse set of synthetic images. Wang et al. compare the most influential GAN architectures with respect to image quality, performance against vanishing gradient and ability for mode diversity. In order to create synthetic artwork, where shape must greatly vary and colors are very relevant, we chose to focus on a mechanism proven to excel in diversity; we adopted Self-Attention Generative Adversarial Network (SAGAN) with some modifications which will be explained later. We chose this architecture as the basis of our GAN as the self-attention mechanism helps the GAN to learn global and long-range dependencies across multi-class images [39].

Fig. 1.
figure 1

The self-attention mechanism [43]

Figure 1 shows a self-attention module. Transformers are used to create the key f(x), query g(x) and value h(x):

$$ f(x) = W_{f}x, \qquad g(x) = W_{g}x, \qquad h(x) = W_{h}x. $$

The attention map is created after applying a softmax to the dot product of the key and the query (Eq. 1). Another dot product is taken between the attention map and the value; the attention map is applied to the value in order to create a self-attention map (\(o_{j}\)) (Eq. 2).

$$\begin{aligned} \alpha _{j,i} = \mathrm {softmax}(f(x_{i})^{\top } g(x_{j})), \end{aligned}$$
(1)
$$\begin{aligned} o_{j} = W_{v}\left( \sum _{i=1}^{N}\alpha _{j,i}h(x_{i})\right) . \end{aligned}$$
(2)

The parameters \(W_{f}\), \(W_{g}\), \(W_{h}\) and \(W_{v}\) are the learned weight matrices.

3.3 Adversarial Loss Function

The original version of the SAGAN training configuration used hinge loss. For our dataset, we experienced frequent mode collapse and vanishing gradient with this loss, likely due to our dataset’s diversity. Moreover, in order to guarantee this to be k-Lipschitz, as needed to prevent mode collapse, a Gradient Penalty was added to the loss function. For this reason, we used the Wasserstein with Gradient Penalty (Wasserstein-GP) loss function so as to have better control of the values for feedback. That is, the loss is:

$$\begin{aligned} L = \mathop {\mathbb {E}}_{\tilde{x}\sim \mathbb {P}_{g}} \left[ D(\tilde{x}) \right] - \mathop {\mathbb {E}}_{x\sim \mathbb {P}_{r}} \left[ D(x) \right] + \lambda \mathop {\mathbb {E}}_{\hat{x}\sim \mathbb {P}_{\hat{x}}} \left[ (||\nabla _{\hat{x}} D(\hat{x})||_{2} - 1)^2\right] , \end{aligned}$$
(3)

where \(\mathbb {P}_{r}\) is the real images distribution, \(\mathbb {P}_{g}\) is the generated images distribution and \(\mathbb {P}_{\hat{x}}\) is the sampling uniformly along straight lines between pairs of points sampled from the data distribution \(\mathbb {P}_{r}\) and the generator distribution \(\mathbb {P}_{g}\). In our experiments, the interpolation between a batch of real images and fake images was enforced and the gradient norm of its output was limited at 1. The value of the penalty coefficient (\(\lambda \)) was 10, following the original paper [20].

3.4 EfficientNet

The EfficientNet B0 architecture was chosen for these experiments due to its high performance and relatively small size. EfficientNet B0 belongs to a family of models known as EfficientNet, presented by Tan and Le [35]. The EfficientNet B0 is the EfficientNet family’s baseline, from which all the models from B1 to B7 are scaled up uniformly in dimensions of depth, width and resolution. Table 1 summarizes the stages in the EfficientNet B0. Each stage consists of one or more layers built with blocks of MBConv [30], which are combined with a Squeeze-and-Excitation optimization [22].

Table 1. Summary of the composition of the EfficientNet B0

4 Experimental Results

The methods described previously are only useful if they do lead to improvement in realistic circumstances. We developed and tested them by dealing with real artwork and existing datasets. In this section, we present the dataset we used and its characteristics, the training configurations for both GAN and classifier and the results for each step of the Class-by-Class Performance Analysis.

4.1 The Wikiart Dataset

We used the Wikiart dataset in our experiments. In fact, the version we used was the one discussed by Elgammal et al. [14], from which we adopted the following conventions:

  • New Realism and Contemporary Realism were added to Realism;

  • Action Painting was added to Abstract-Expressionism;

  • Synthetic Cubism and Analytical Cubism were added to Cubism.

Table 2. Dataset used in experiments.

A total of 63,659 images are available there; 10% of them were used for testing and 10% of the remaining dataset was used for validation. The training volumetry of the image distribution is presented in the last column of Table 2. The training dataset is used both in classifier training and in GAN training. The test dataset is the same for all experiments to ensure comparability of results. The class imbalance is apparent with the volumetry varying between 10,566 and 940 images.

4.2 GAN Training Configuration

In order to accommodate the training process in 2 GPUs GeForce GTX 1080 Ti (12 GB) to generate images of 128 \(\times \) 128, we lowered the batch size to 32 images (the SAGAN was trained with batch size of 256 images). The process ran for 200.000 epochs (approximately 34 h). Each class was trained independently, using the equivalent training dataset of the classifier. The optimizer setup follows the original SAGAN article: Adam optimizer with \(\beta _{1} = 0\) and \(\beta _{2} = 0.9\). The learning rate is constant but specific for each model: for the discriminator is 0.0004 and for the generator is 0.0001 [43].

4.3 EfficientNet B0 Training Configuration

Using the ImageNet pretrained model, the fine-tuning for our classification purpose was done by unfreezing each block until there was no further improvement on the validation dataset. The Stochastic Gradient Descent optimizer was used with decay 0.9 and momentum 0.9; initial learning rate of 0.01 with decay after the fifth epoch (\(lr = lr*e^{-0.1}\)). Images were resized to 224 \(\times \) 244 and batch size was 32 images.

4.4 Baseline Results

The result of the EfficientNet B0 baseline trained model and the EfficientNet B0 trained model with geometric augmentation are shown in Table 6 (first and second lines). The performance for each class of the latter is shown in Fig. 2. It is important to highlight that low performance is not correlated to image quantity: the Ukiyo-e movement – the Japanese style – had the least amount of images and the best f1-score. Still, the worst performance was obtained by the Pop Art class with only 1205 images. On the other hand, the second and third worst performances belong to two classes with the largest number of images - Expressionism and Romanticism.

Fig. 2.
figure 2

Analysis of the trained EfficientNet B0 with geometric augmentation

4.5 Sampling Low Quantity Classes

Pop Art is the fourth lowest class in terms of quantity and images, so the low quantity class sampling strategy was applied. The results are shown in Table 3. The classifier that obtained the best accuracy score is not the same as that obtained the best f1-score for the Pop Art class. This shows that other classes have benefited from the information obtained from the images generated (Tables 4 and 5).

Table 3. Summary of experiments with added synthetic Pop Art images

4.6 Sampling High Quantity Classes

Both Expressionism and Romanticism classes have a similar behavior to that of our reference model: they have many more images than most of the classes – more than four times Pop Art’s image quantity – but low performance. It is interesting to observe that for both classes, the quantity of images that generated the best results was 1/4 of its original training data quantity. The synthetic Romantic class had a better f1-score when 3/4 of the quantity of the original data was added, but still didn’t produce a better classifier.

Table 4. Summary of experiments with added synthetic Expressionist images
Table 5. Summary of experiments with added synthetic Romantic images

4.7 Summary of Results

Figure 3 shows how each class performed in the best experiment for each class (Pop Art, Expressionism and Romanticism). Although the classification problem has many classes to be able to point out correlations, it is possible to observe that the Art Nouveau class had its best performance with synthetic images from the Pop Art artistic movement. It is also observed that the Pop Art class itself performs even better when we have the information of the synthetic images of the artistic movements Romanticism and Expressionism.

Fig. 3.
figure 3

The evolution of performance for each class during experiments.

Table 6 also shows the best result of each EfficientNet B0 model trained with synthetic artwork class. The strategy of Class-by-Class Performance Analysis allowed us to improve the classifier accuracy by almost 2%.

Table 6. EfficientNet B0 trained models.
Fig. 4.
figure 4

First row of images contains real Pop Art painting and second row contains generated Pop Art painting.

Fig. 5.
figure 5

First row of images contains real Expressionist painting and second row contains generated Expressionist painting.

Fig. 6.
figure 6

First row of images contains real Romantic painting and second row contains generated Romantic painting.

4.8 Generated Images

Figures 3, 4 and 5 show a sample of the images generated in our experiments. To help in understanding the quality of these generated images, real images of corresponding artistic styles were added above. By visual inspection, it is noticeable that the generated images retain general properties of the styles. For the three depicted artistic styles, it is clear that the models are flawed in terms of shape definition, but they are competent in the choice of colors, even in the variety of colors that a style contains and correctly separating grayscale images from colored images (examples in Figs. 4 and 5) (Fig. 6).

5 Conclusion

Artwork style classification is quite challenging: class imbalance and high diversity within classes and similarities between classes are inevitable given existing art styles. Besides, classification performance does not correlate with dataset size. It is thus natural to look for data augmentation strategies; however, straightforward techniques do not generate images that can help distinguish between class styles.

In this paper we explored GANs for data augmentation in this setting; this is a little explored avenue and one that cannot be taken without some care. To be able to actually use GANs to generate artwork-like images with the desirable properties and diversity, we had to develop a working methodology, a specialized network architecture, and an adapted loss function. These technical contributions should be valuable in other settings where artwork is to be classified; moreover, they should be useful in data augmentation scenarios with many classes that are hard to differentiate and that are sensitive to popular existing techniques such as coloring and rotation.

We found that our approach and methodology can generate better classifiers. In particular, the combination of self-attention for mode diversity and the Wasserstein-GP loss function against vanishing gradient enabled us, in our experiments, to improve the augmentation of the images and consequently increase the accuracy of the model by almost 2 %.