1 Introduction

Understanding and analyzing geological characteristics and their influence on reservoir properties is crucial for the success of hydrocarbon exploration and production. Moreover, automating these processes can result in the production of deliverables that are useful in well planning optimization, reservoir modeling, and risk analysis (Zheng et al. 2019). The Pre-Salt region in Brazil has enormous exploration potential, with reserve estimates suggesting that it contains approximately 70 to 100 billion barrels of oil. The majority of production comes from the Santos Basin, which is located in the southeastern part of the country and accounts for over 70% of Brazilian production. Therefore, geological and geophysical studies, in conjunction with technological tools, have been employed to investigate optimal exploration conditions and identify the structural and sedimentary characteristics of these complex basins that contribute to their potential for hydrocarbon accumulation.

Gaining a comprehensive understanding of the spatial distribution of structural, faciological, and petrophysical properties can aid in optimizing the characterization, prediction, and recovery phases of reservoirs. This is achieved by creating more reliable models. The petrophysical properties of interest include the porous medium, which comprises matrix systems, vugs, and fractures that define spaces for fluid accumulation in rock, as well as permeability, which determines the connectivity of pores and is critical for percolation and the recovery of subsurface fluids.

Faults and fractures are crucial features within the porous medium, as they play a vital role in engineering, geotechnical, and hydrogeological applications. They can exhibit dual behavior by either serving as pathways for fluid flow or barriers, depending on factors such as intensity, connectivity, dissolution, or cementing, which renders them impermeable. Numerous oil, gas, geothermal, and water supply reservoirs are formed in fractured rock. Therefore, an essential aspect of comprehending and predicting fracture behavior involves identifying and locating those that demonstrate hydraulically significant behavior (Council 1996). In order to analyze the properties of data from sources such as the pre-salt, which is located in ultra-deep waters below 5000m, and to identify fault regions, it is necessary to conduct seismic studies. This type of data offers insight into the structures present throughout the entire region of interest (Dondurur 2018) providing information about layer structures and their features. Upon obtaining this data, segmentation techniques are employed to extract the structures that will inform subsequent analyses.

Our aim is to incorporate the Transformer concept in the automated segmentation of seismic data. While convolutional networks have been used in the literature for this task, recent studies have highlighted the advantages of Transformers in computer vision tasks. According to Bi et al. (2021), Transformers exhibit a lower inductive bias, resulting in better performance due to fewer assumptions about the optimal approach. To this end, we have selected the TransUNet (Chen et al. 2021) model, which combines Transformers and U-Net, as a promising alternative for identifying faults through image segmentation.

2 Background

The analysis of seismic data is crucial for the progress of hydrocarbon exploration, and it is commonly done through the study of geological structures. Various techniques have been employed to analyze seismic data, including machine learning and image processing, which aim to automate and facilitate the interpretation process, using seismic data in the form of an image. Pepper and Bejarano (2005), for example, presented case studies on automatic fault interpretation using only seismic attributes that highlight faults. These attributes work similarly to filtering techniques used in image processing, and two of them, dip and azimuth, showed the best results in identifying fault regions that were extracted as connected components. Zhao and Mukhopadhyay (2018) explored the task of fault detection in synthetic and field data by using convolutional neural networks (CNNs) to develop prediction models. In particular, Zhao and Mukhopadhyay (2018) improved the final result by adding image processing algorithms, such as smoothing and sharpening, after the prediction step.

As the need for more robust models for seismic interpretation has become evident, researchers have turned to deep machine learning models for performing these tasks. Wu et al. (2019b) developed several models with a primary focus on fault prediction, including FaultNet3D and FaultSeg3D (Wu et al. 2019a). Using a single CNN, the FaultNet3D model aimed to estimate the probability of faults, cracks, and dips. Meanwhile, the FaultSeg3D model focused on fault delineation, with its output being a binary mask representing the seismic data, where 1 denotes the presence of faults and 0 represents the absence of faults.

Research on fault identification remains crucial in the geological context, as seismic data acquisition has significantly increased and deep convolutional neural networks have been successfully applied. Recent approaches, including (An et al. 2021), have created a large database labeled by experts to supplement synthetic data. A deep CNN based on edge detection has been proposed, producing a pixel-by-pixel binary classification of faults with superior results compared to commonly used CNNs.

3 Materials

The primary material for this work is the seismic data, which will be introduced along with its acquisition. Figure 1 shows the four main parts of this phase: data section, data split, data augmentation, and data normalization. Initially, two seismic cubes representing the input and target will be used to generate subsamples that will be augmented to provide the model with a diverse set of inputs.

Fig. 1.
figure 1

Pre-processing workflow.

3.1 Seismic Data and Acquisition

The acquisition of seismic data using a marine approach is the foundation of this work, as it enables the extraction of information and characteristics from sedimentary basins, such as those in the Brazilian pre-salt. In order to conduct seismic surveys, a range of computational tools and systems are employed, with compatible real-time communication (Dondurur 2018). This process involves propagating elastic waves through the subsurface medium, which then reflect off interfaces and return to the surface, where they are detected by receivers.

Sound waves are generated by equipment called airguns, which penetrate the marine subsoil and, upon reflection, are detected by receivers equipped with vibrating coils that produce electrical signals. These receivers, such as hydrophones, are often positioned near the water surface, attached by cables, or towed by seismic vessels (known as streamers). The electrical signals are transmitted via cables to a seismograph recorder on the ship, resulting in representative images of subsurface structures. The data then undergo a careful processing step, in which they are grouped and the signal-to-noise ratio is enhanced to create images of subsurface structures.

The outcome of this acquisition process is a seismic volume, which is a three-dimensional function denoted by a(xyz). It reflects the changes in seismic amplitude (a) along the three coordinates: x, y, and z. The seismic amplitude (a) can be expressed as a function of depth (z), a function of (xz), or a function of (xyz) (Alsadi 2017).

The dataset utilized in this study comprises two seismic volumes: the input and the target. The dimensions of both volumes are 1401\(\times \)1481\(\times \)241 pixels, and they cover an area of approximately 240 km\(^2\), encompassing two pre-salt Santos basin fields. The vertical limit of the volumes is around 2000m within the area of interest. The input volume represents the amplitude seismic values of the area, while the target volume contains the faults interpreted from the amplitude seismic. The target volume was generated from 94 interpreted faults, as depicted in Fig. 2, and used to create a binary model of fault (1) and no-fault (0) scenarios, based on the proximity to faults.

Fig. 2.
figure 2

Target seismic definition.

3.2 Pre-processing

We start by converting the seismic cube and mask into a NumPy array, which are structured in a three-dimensional form. Then, the data is manipulated in 2D sections to generate 2D sub-images that can be used to train the model. For this purpose, we extract smaller patches with dimensions \(p \times p\) pixels from the seismic inlines, where \(p=M\). This method provides a large image dataset with samples that have a conventional square dimensionality suitable for convolutional models. To avoid repetition of information, the inline region is sectioned into subimages with a \(stride=M\), without overlapping, as shown in Fig. 3. This process generates \(\lfloor (n/p)\rfloor \) subimages from each of the treated inline images.

Fig. 3.
figure 3

Process of data sectioning.

Once the seismic data, input, and target have been prepared, the entire image dataset is randomized and divided into three separate sets: training, validation, and testing. These subsets are partitioned into a ratio of 70%, 15%, and 15% accordingly. The correspondence between the original seismic and the fault mask is maintained throughout the entire process. Therefore, once the data had been cropped and separated, we opted to apply data augmentation to the images that displayed fault presence. This was necessary since, in this type of task, the majority of the dataset contains images without faults, and classifying pixels one by one tends to be more unbalanced, with more pixels labeled as 0 than as 1 (non-fault/fault) (Wei et al. 2022).

The augmentation technique applied to the database was the flip transformation. Although there are various geometric transformations that can be used for augmentation, it’s important to consider the data domain to ensure that the operations do not introduce errors in the learning process. In this case, flipping was chosen as it preserves the original orientation of faults, which is typically vertical. By rotating on the vertical axis, the flip operation doubles the number of images, as illustrated in Fig. 4.

Fig. 4.
figure 4

Vertical flipping transformation.

Using the described procedure, we initially generated a database of 9369 sub-images from our seismic dataset, which had a dimension of 1401\(\times \)1481\(\times \)160 pixels (inline/xlines/crossline). This was achieved by generating \(\lfloor (n/p)\rfloor = 9\) images per inline, where \(p = 160\). The resulting images were split into 6558 for training, 1405 for validation, and 1406 for testing. To increase the training and validation sets, data augmentation was applied to all images containing faults, resulting in a 29% increase in the training data and a 36% increase in the test data. As a result, the total number of images in the respective groups became 9242 and 1914.

Finally, all 2D image sections referring to seismic are normalized between -1 and 1, using the following equation:

$$\begin{aligned} \textit{img} = 2 \cdot \frac{\textit{img}[:,:] - \min }{\max - \min } -1 \end{aligned}$$
(1)

where img is the 2D image resulting from the normalization, \(\min \) is the minimum image value, and \(\max \) is the maximum image value.

4 Methods

The TransUNet model, which uses Transformers, is utilized in this work. Given the success of this architecture in the realm of visual computing, our aim is to evaluate its efficacy for fault extraction in heterogeneous seismic fields and compare it with traditional models such as convolutional neural networks. The Fig. 5 shows the training input and the models used in this process.

Fig. 5.
figure 5

Methodology workflow.

4.1 CNN Models

Image classification, a fundamental problem in computer vision, involves categorizing images into predefined classes and serves as the basis for other tasks such as region localization, detection, and segmentation. The CNNs are one of the most commonly used deep learning networks for this task, named after the linear mathematical operation called convolution between matrices (Albawi et al. 2017). CNNs are a type of feedforward neural networks, meaning that the information flows only in one direction, from the input to the output. Inspired by biological neural networks (Rawat and Wang 2017), CNNs differ from regular neural networks in that each unit in a CNN layer is a two-dimensional filter that is convolved with the input of that layer, enabling them to extract local features from images (Khan et al. 2018).

The convolutional layer of a CNN consists of a two-dimensional filter that convolves with the input feature map. The filter is an array of discrete numbers, where each element is a weight, learned during the training phase. In the beginning, these weights are randomly assigned until the learning process updates them. The CNN architecture includes layers for convolution, pooling to reduce the feature map size, and fully connected layers where each neuron is directly connected to the neurons in the previous and next layers. To gain an understanding of how convolutional networks can be constructed, we will explore three models from the U-Net family and subsequently utilize them to draw comparisons with the Transformer architecture.

U-Net. The original purpose of the U-Net network was to perform segmentation of medical images, and its architecture was an update and extension of the fully connected network. U-Net aimed to improve segmentation accuracy while minimizing the required amount of data (Long et al. 2015). The U-Net architecture, proposed by Ronneberger et al. (2015), consists of a contraction path (left side) and an expanding path (right side) for accurate segmentation of medical images. The contraction path has a typical convolutional network architecture, comprising two 3\(\times \)3 convolutions, each followed by a rectified linear unit (ReLU), and a max-pooling operation with a 2\(\times \)2 filter and stride 2 for downsampling. The number of feature channels is doubled at each downsampling step. The expansive path, on the other hand, involves increasing the resolution of the feature map, followed by a 2\(\times \)2 convolution (half the number of channels), concatenation with the corresponding feature map of the contraction path, and two 3\(\times \)3 convolutions (each followed by a ReLU). The last layer uses a 1\(\times \)1 convolution to map the output to the desired number of classes.

U-Net ++ . Zhou et al. (2018) developed the U-Net++ architecture with the goal of improving the accuracy of medical image segmentation. This architecture is based on dense and nested skip connections, which provide a new approach to the segmentation task. The U-Net++ architecture was developed to enhance the performance of medical image segmentation by addressing limitations in previous models. Zhou et al. (2019) proposed an approach that employs multiple U-Nets with different depths, where the encoders and decoders are connected by dense and nested skip connections. The U-Nets share an encoder, while their decoders are interconnected, and deep supervision is used during to train all the constituent U-Nets while benefiting from a shared image representation. The redesigned skip connections in U-Net++ allow for variable-scale feature maps at a decoder node, enabling the aggregation layer to decide how attribute maps carried over the skip connections should be merged with the decoder feature maps.

SegNet. The SegNet (Badrinarayanan et al. 2017) is a convolutional neural network architecture designed for semantic pixel segmentation, comprising of an encoder network, a corresponding decoder network, and a pixel-wise classification layer. The encoder network architecture is similar to that of VGG-16 with 13 convolutional layers. The decoder network also has 13 layers, each corresponding to an encoder layer. The final output of the decoder is fed into a multiclass softmax classifier to generate class probabilities for each pixel. Unlike U-Net, SegNet does not reuse pooling indices but transfers the entire attribute map to the corresponding decoder and concatenates them into upsampled decoder feature maps through deconvolution.

4.2 Transformer Models

The Transformer (Vaswani et al. 2017; Khan et al. 2022) is a novel neural network that utilizes attention operations and was originally developed for natural language processing (NLP), where it has demonstrated remarkable success (Huang et al. 2020). In the field of computer vision, the Transformer has been increasingly employed to replace traditional techniques, resulting in various advantages (Bi et al. 2021).

The Transformer architecture includes an encoder and a decoder, both containing multiple attention blocks with the same architecture. The encoder produces encodings of the input, while the decoder takes these encodings and utilizes its contextual information to generate the output sequence (Han et al. 2022). Specifically, the Transformer encoder is composed of L layers of Multihead Self-Attention (MSA) and Multi-Layer Perceptron (MLP) blocks, alternating between the two. Before each block, Layer Normalization (LN) is applied, and residual connections are used after every block. Finally, the encoded feature representation is upsampled to full resolution to predict the dense output.

The success of the transformer in NLP has encouraged researchers to explore its potential in other areas. Consequently, similar models have been developed to learn useful image representations using the Transformer’s concept. The Vision Transformer (ViT) (Dosovitskiy et al. 2020), for instance, has proven to be highly effective in several benchmarks, drawing inspiration from the self-attention mechanism in NLP, where word embeddings are substituted by patch embeddings (Fu 2022).

ViT has paved the way for the development of several other models based on attention mechanisms, which have brought about significant advances in various fields of computer vision. Surveys conducted by Guo et al. (2022) and Han et al. (2022) have shown that attention-based methods have been beneficial for tasks such as image classification, semantic segmentation, face recognition, few-shot learning, medical image processing, image resolution, 3D vision, among others.

TransUNet (Chen et al. 2021) is a model that harnesses the power of transformers for medical image segmentation. By combining CNN architectures, such as U-Net, which can extract low-level visual features to preserve fine spatial details, and Transformers, which excel in modeling global context, TransUNet creates a powerful hybrid architecture for accurate and efficient medical image segmentation.

TransUNet Architecture. TransUNet combines CNN and Transformer architectures to leverage the spatial details of CNN features and the global context captured by Transformers for medical image segmentation. The model follows a U-shape design, where Transformers establish self-attention mechanisms to encode the features in a sequence-by-sequence prediction perspective. The resulting self-attentive feature is upsampled and combined with high-resolution CNN features that were skipped during encoding, enabling precise localization.

Transformer is used as an encoder by transforming the input image into a sequence of flattened 2D patches through tokenization. To achieve this, the input image x is reshaped into N patches of size \(P \times P\), where N is determined by the image’s height and width (H and W) and the patch size (P), such that \(N=\frac{HW}{P^{2}}\). A unique marker is assigned to each patch to preserve its positional information in the sequence, and the resulting sequence is fed as input to the encoder.

To recover the spatial order during upsampling, the encoded feature size is reshaped from \(\frac{HW}{P^{2}}\) to \(\frac{H}{P} \times \frac{W}{P}\), while the number of channels is reduced to the number of classes using 1\(\times \)1 convolutions. Finally, the feature map is bilinearly upsampled to the full resolution of \(H \times W\) to generate the final segmentation output. To address the issue of partial information loss resulting from using Transformer solely as an encoder, TransUNet utilizes a hybrid CNN-Transformer architecture that first leverages CNN to extract features from the input, followed by patch embedding of 1\(\times \)1 patches extracted from the CNN feature map instead of the raw images. Therefore, the sequence of hidden features is reshaped to achieve full resolution from \(\frac{H}{P} \times \frac{W}{P}\) to \(H \times W\) by applying multiple cascades of upsampling blocks. Each block includes a 2\(\times \) upsampling operator, a 3\(\times \)3 convolution layer, and a ReLU layer in sequence. This enables the aggregation of features at different resolution levels through skip connections.

4.3 Evaluation Metrics

As the task can be seen as a semantic segmentation task, it is crucial to rigorously evaluate the system’s efficiency for it to be useful and produce effective contributions. To evaluate the effectiveness of a segmentation system used for extracting faults in images using machine learning methods, two essential metrics that assess the quality of segmentation were selected:

  • Jaccard index: also known as the intersection over union (IoU), it quantifies the percentage overlap between the mask input (i.e., the target of the segmentation) and the predicted output. The Jaccard index (Shi et al. 2014) is calculated as the number of pixels that are common between the two images \((A\cap B)\) divided by the number of pixels resulting from the union of both \((A\cup B)\).

  • Dice coefficient: is a measure frequently utilized in the computer vision field to evaluate the similarity between two images (Crum et al. 2006). It is similar to the IoU and computed as twice the overlap area divided by the total number of pixels in both images.

The performance assessment of the employed models can also be examined through a pixel-wise classification approach. This method employs metrics, including accuracy, recall, precision, and F1-score, to evaluate the prediction performance of positive samples. The matrix presents the predictions made for each class and provides a representation to measure the effectiveness of the classification model. In the case of fault detection applications, positive samples represent faults and their locations in a seismic volume (Huang et al. 2017).

5 Results and Discussions

The presentation and discussion of the results obtained from the trained models, seismic areas reconstruction, and performance metrics are presented using the previously described model.

5.1 Experiments

To assess the effectiveness of applying Transformer on seismic data, we utilized the TransUNet model on the database outlined in Sect. 3. Moreover, we employed three additional models that employ the convolutional neural network approach to compare the attained results.

The architectures used were parameterized with identical settings for the loss function, learning rate, and number of epochs to enable comparison of outcomes with the same initialization. An empirical value of \(1e-4\) was chosen for the learning rate based on previous evaluations of different values. For the loss function, binary cross-entropy was chosen as it is commonly used for classification purposes and semantic segmentation is a pixel-level classification task. The number of epochs was set to 100 for all models, except for TransUNet where the batch size was reduced to 16 due to its higher memory complexity compared to the other networks.

The results obtained from the execution of all methods are presented in Table 1, where it can be observed that the TransUNet network surpasses the other architectures (U-Net, U-Net++, and SegNet) by 2.35%, 4.93%, and 5.03%, respectively, achieving an overall Dice score of 88.34%. Considering the IoU metric, this difference increases to 4.86%, 6.73%, and 7.23%, with TransUNet obtaining a value of 84.34%. It is worth noting that the difference between the values obtained by these two metrics is mainly due to the high penalty imposed by the IoU in cases where the classification results are poor. The evaluation of the other metrics confirms the superior performance of the TransUNet network.

Table 1. Quantitative comparison of the segmentation performance in fault detection task.

The comparisons between the predictions made by the previously presented models are illustrated in Fig. 6. The results indicate that U-Net and TransUNet produced images that are more similar to the target, while SegNet and U-Net++ exhibit a considerable amount of noise in their outputs. When comparing U-Net and TransUNet, a slightly more accurate border delimitation can be observed in U-Net. This may be due to the greater abstraction of global context extraction in the initial layers of TransUNet.

Fig. 6.
figure 6

Qualitative comparison of different models applied to seismic segmentation.

In order to conduct a more comprehensive analysis of the performance of the two networks, we reconstructed the slices that were formed by each predicted sub-image. This reconstruction was made possible by the fact that each input image to the network has a corresponding nomenclature that corresponds to the seismic inline and its cut order. Consequently, the training images were predicted in sequential order and their results were concatenated, as illustrated in an example shown in Fig. 7.

Fig. 7.
figure 7

Image prediction and concatenation process.

By applying this approach, we can reconstruct the entire seismic volume and examine the output of the two models. Figure 8 presents a comparison of the predictions made for two different slices, indicating that both models were able to detect the structures highlighted in the target, as well as some smaller regions, with TransUNet providing a more significant representation of them.

Fig. 8.
figure 8

Comparison between U-Net and TransUNet predictions.

Upon analyzing the overall result by visualizing the seismic cube prediction, as presented in Fig. 9, it is evident that both models were able to identify most of the structures present in the target. However, some additional small regions were also detected, which require detailed analysis when viewed in two dimensions, as depicted in Fig. 10.

The TransUNet prediction achieved slightly better results, with more indicated faults and greater vertical continuity of faults outside the regions present in the target binary cube. Nevertheless, the overall difference between the two methods was minimal.

Fig. 9.
figure 9

Comparison between U-Net and TransUNet prediction in full 3D field.

Fig. 10.
figure 10

Analysis of predicted failures in two different regions, A and B.

6 Conclusions

This work investigated the use of a hybrid model, TransUNet, which combines the strengths of convolutional networks and Transformer’s content abstraction in the geological context. The results demonstrate the effectiveness of this approach in segmenting seismic images from a heterogeneous environment, such as the pre-salt layer, indicating potential applications of this architecture in various configurations for identifying and extracting geological structures in the field of seismic imaging.

The effectiveness of TransUNet in fault identification on seismic data was demonstrated by comparing it to conventional state-of-the-art methods. This study concludes that the incorporation of Transformer in this context has the potential to extract valuable information from seismic databases. The evaluation included both qualitative and quantitative approaches, suggesting that this new type of architecture could serve as a benchmark for other databases considering the use of TransUNet in the field of visual computing.