1 Introduction

Self-Supervised Learning (SSL) has the community’s attention, being a powerful tool to learn valuable representations from large amounts of unlabeled data [2, 18]. Essentially, SSL approaches conceive a pretext task in which an encoder, coupled with a projection head, is encouraged to learn efficient data representations. The pretext labels are directly derived from the own input samples (originally unlabeled). After training, the encoder is detached from the projection head, and the knowledge is transferred to the downstream task coupled with a prediction head to solve the desired task. This is especially useful to learn representations of data that can be explored in multiple tasks, also saving annotation time, which is time-consuming and expensive [4].

Applying SSL to human activity recognition (HAR) data is a trend [13], due to limited size of labeled datasets, the high data requirements of models, and the possibility to collect data on a large scale [14]. This scenario motivated this study to evaluate the technique focused on time series data known as Temporal Neighborhood Coding technique (TNC) [10] for the HAR task, predicting what activity is being performed based on inertial sensor data. SSL in time series can be divided into three main categories [6, 16] : generative-based, contrastive-based, and adversarial-based methods.

The contrastive-based methods learn representation through the sample comparison of positive and negative pairs. According to Zhang et al.  [16], the contrastive methods can be classified into five subcategories: prediction contrast, augmentation contrast, prototype contrast, expert knowledge contrast, and sampling contrast. In prediction contrast, the pretext task is to predict the representations for future segments/windows from previous excerpts of the time series, whereas in augmentation contrast, the representations created for augmented versions of a time window are forced to be similar, but different from those related to other distant time windows. The prototype contrast makes the clustering of garnered representations pairs. The expert knowledge contrast incorporates expert prior knowledge to choose the correct positive or negative sampling during the training. The sampling contrast uses time series segment windows and generated augmentations to compare each to check the similarity.

TNC [10] is a SSL technique based on constrastive learning paradigm. This approach aims to learn representations by comparing pairs of similar and dissimilar samples. Unlike traditional contrastive methods that use augmentations to generate positive and negative pairs, TNC defines these pairs based on the concept of temporal neighborhoods. The original version uses the Augmented Dickey-Fuller (ADF) statistical test [7] in order to identify the stationarity of a signal, and to define positive and negative pairs. The technique was originally tested with a bidirectional single-layer Recurrent Neural Network (RNN) encoder.

After the original work of Tonekaboni et al.  [10], modified versions of TNC have been considered, such as TNC-sim [11], which proposed replacing the ADF statistical test by the cosine similarity metric. Additionally, the work of Retrieval-Based Reconstruction (REBAR) [12] uses TNC with a different encoder, exploring dilated convolutions proposed in TS2Vec [15] instead of a RNN. Available research works implement some of these alternatives in specific contexts, but do not establish a detailed comparison between them in the same dataset.

In this study, we assess the performance of various TNC variations. The main contributions of this work are:

  1. 1.

    We evaluated the different TNC variations under identical conditions, utilizing the same code base and UCI dataset with raw data and showed that the TS2Vec encoder significantly outperforms the RNN encoder for this task, achieving accuracies that are 15 to 17% points higher. We also show that replacing the ADF statistical test by the cosine similarity has little impact on model accuracy, however, it reduces the training times from \(7 \times \) to \(9 \times \).

  2. 2.

    We also evaluated the impact of applying a TNC encoder to both raw data and data represented by handcrafted features (produced and provided by the authors of the UCI [8] dataset). Our findings indicate that learning from handcrafted features is easier. Still, the more advanced versions of TNC can effectively learn robust features from the raw data, achieving performance comparable to models trained on handcrafted features.

The rest of this paper is organized as follows: Sect. 2 presents the related works. Section 3 provides an overview of the TNC technique and its variants.

Section 4 describes the material and methods and discusses the experimental results. Finally, Sect. 5 provides the main conclusions and possible future work.

2 Related Works

This section describes the related works, defining the datasets used at this paper, followed by a comparison between works that evaluated TNC performance under different circumstances such as datasets, encoders, and similarity functions.

Before discussing the TNC-related works, it is important to distinguish two variants of the UCI dataset for HAR, a dataset that is frequently used to evaluate Machine Learning (ML) models on HAR tasks. The UCI dataset, provided by the University of California, Irvine (UCI) [8], contains data extracted from smartphones’ accelerometers and gyroscope and is annotated with human activity labels (e.g., walking, laying, etc.). The first version consisted of a preprocessed dataset in which authors extracted 561 handcrafted features from the raw accelerometer and gyroscope data. We refer to this version of the UCI dataset as UCI-FE (Feature Engineered). Later, the authors disclosed an updated version that contains the raw signal from the smartphone’s accelerometers and gyroscopes. We refer to this version as UCI-Raw.

The Temporal Neighborhood Coding (TNC) technique was first introduced by Tonekaboni et al.  [10], in 2021. Their work employed a bidirectional single layer RNN as the encoder and the ADF test to determine the interval around a time window where the signal can be approximately considered as stationary. They evaluated their technique on the UCI-FE dataset and reported an accuracy of 88.3%.

In 2023, Wang et al.  [11] explored a variation of the TNC technique using the feature similarity of the time series measured by the cosine similarity between time windows to establish the range of a temporal neighborhood instead of ADF, selecting the most similar windows as neighbors instead of using the statistical test. They used the same encoder as Tonekaboni et al.  [10], i.e., an RNN, and compared the proposed method with ADF on the sleep stages classification task. Their results suggest that the TNC-sim performs 2.81% better than TNC-adf in time domain with three classes and 2.51% with five classes.

In 2023, Xu et al.  [12] evaluated a variation at the TNC technique employing a dilated convolution encoder on the UCI-Raw dataset and verified large performance gains when compared to the default RNN, reaching an accuracy of 94.3% in UCI-raw. This is significant as the dataset in this case does not contain handcrafted features and the encoder is learning to extract knowledge better than after feature engineering. The chosen encoder structure was the one employed by Yue et al.  [15], that introduced the TS2Vec technique. Yue et al.  [15] report significant performance improvement for TNC using their dilated convolution encoder, stating that this happens due to its architecture being more adaptable to different scales of datasets due to the receptive fields of dilated convolution.

Other works also compared the performance of RNNs against dilated convolutions. Franceschi et al.  [3] use deep neural networks with exponentially dilated causal convolutions that capture long-range dependencies better than full convolutions with triplet loss employing time-based negative sampling at variable length multivariate time series, presenting the encoder as more efficient and scalable than using an RNN. Bai et al.  [1] also supports this assumption, presenting results in which dilated convolutions outperformed RNN in terms of efficiency and predictive performance.

The aforementioned works evaluated TNC with different encoders (RNN and TS2Vec) and test functions (ADF and Cosine Similarity).

However, these works evaluate TNC under different, specific circumstances (ML task, dataset, etc.). Moreover, albeit some comparisons between TS2Vec and RNN are provided, such works do not carry out experiments focused on analyzing the impact of these alternatives in a clear and standard manner. Hence, in this work, we evaluate all these combinations with the same code base and on the same setup, i.e., human activity recognition using the UCI-Raw dataset.

Table 1 summarizes the TNC-based works presented in this section according to the dataset, encoder and test functions evaluated in each work.

Table 1. TNC related works

3 TNC Technique and Its Variants

The TNC technique is based on the principle that neighboring windows in a time series are likely to belong to the same class and, therefore, should have similar representations. Conversely, distant (non-neighboring) windows are more likely to belong to different classes and should have distinct representations. To achieve this, the TNC technique trains an encoder (Enc) that encodes the time series windows and a discriminator (D) that determines whether a pair of windows are neighbors (i.e., close to each other).

Figure 1 illustrates the TNC technique. First, given a query window (\(W_q\)) that is randomly selected, and a window selector, \(W_s\) selects two extra windows: a close one (\(W_c\)), and a distant one (\(W_d\)). Then, \(W_q\), \(W_c\), and \(W_d\) are encoded using Enc into \(Z_q\), \(Z_c\), and \(Z_d\), respectively. Finally, the discriminator D is used to discriminate whether the pairs (\(Z_q\), \(Z_c\)) and (\(Z_q\), \(Z_d\)) are neighbors (close) or not.

Fig. 1.
figure 1

Technique of TNC. The window selector \(W_s\) selects the neighbor \(W_c\) and non-neighbor window \(W_d\) for each query window \(W_q\). The encoder Enc learns the data representation and feed the samples \(Z_c\) and \(Z_d\) into a discriminator D that predicts the probability of the windows being neighbors to \(W_q\).

The window selector method \(W_s\) tests whether the selected windows (i.e., \(W_c\) and \(W_d\)) are indeed similar to or different from \(W_q\). The test for \(W_c\) can be either the ADF statistical test that determines the Gaussian distribution parameters, enabling a selection of windows that follow that distribution, or to use the window with the highest cosine similarity from \(W_q\). For \(W_d\), windows are sampled randomly, ensuring a sufficient distance in the time series from \(W_q\) to be considered non-neighbors. The work of Wang et al.  [11] presented the cosine similarity as a possibility for constructing non-neighborhoods, but in our case, it was just considered when forming \(W_c\).

The encoder (Enc) can be either a bidirectional RNN with one layer and 100 hidden units, as presented by Tonekaboni et al.  [10], or the TS2Vec dilated convolution encoder with ten residual blocks, each block containing two one-dimensional convolutional layers with a dilation parameter, as in the works of Xu et al.  [12] and Yue et al.  [15]. The discriminator D is a binary classifier that receives the encoded versions of a pair of windows (e.g., \(Z_q\) and \(Z_c\)) and predicts if they are close or not. It is composed by a fully connected layer followed by a ReLU activation, a dropout layer, and another fully connected layer, where the TS2Vec discriminator includes an additional max pooling operation. Losses are calculated for close (pair \(W_q\) and \(W_c\)) and distant windows (pair \(W_q\) and \(W_d\)), including a weighting term W, inspired by Positive-Unlabeled (PU) learning, where each neighbor sample is treated as a positive example, and each non-neighbor sample is a combination of a positive example with a weight and a negative example as a complementary weight. In short, this parameter represents the probability of an unlabeled sample being a positive sample (neighbor) and can be determined by prior knowledge or learned by a hyperparameter search. The combined loss is minimized to train both the discriminator and the encoder, ensuring the model differentiates between temporally close and distant samples. After training the encoder in a self-supervised manner, the weights of the encoder are stored and can be used for the downstream task.

In this work, we evaluate different TNC variants, i.e., using the RNN and the TS2Vec encoders and using the ADF and the Cosine Similarity test functions at the window selector. To distinguish between the fours variants, we will employ the following acronyms: TNC-RNN-adf, TNC-RNN-sim, TNC-TS2Vec-adf, and TNC-TS2Vec-sim. This is the first work that evaluates all this variations of TNC using the same dataset and encoder, allowing a fair comparison between the different configurations. We also experimented with two values for W, which provides an exploration of different configurations of TNC in regards of the encoder, window selector method and weighting term.

4 Experimental Results and Discussion

This section presents our methodology and the experimental results. Initially, Sect. 4.1 describes the materials and methods used to evaluate the TNC technique. Then, Sect. 4.2 presents quantitative analysis, discussing the performance of each variant on the UCI-Raw dataset. Finally, Sect. 4.3 provides a qualitative analysis of the learned representations.

4.1 Materials and Methods

The analysis is divided into two parts. First, we replicate two of the previous works: the work of Tonekaboni et al.  [10] following the code made available at GithubFootnote 1, in which we evaluate the TNC-RNN-adf variant using the UCI-FE dataset; and the work of Xu et al.  [12], in which we evaluate the TNC-TS2Vec-sim variant using the UCI-Raw dataset, following the exact architecture proposed by the authors. This allows us to validate our code base, ensuring we execute the variants with models and parameters compatible with the related works. We also want to assess the performance impact the two variants of the UCI dataset (i.e., UCI-FE and UCI-Raw) have on TNC-RNN-adf, which has not been reported on the literature yet.

After the replication of the original results, we proceeded to the performance analysis of the four TNC variants in the case of the UCI-Raw dataset. We decided to focus on UCI-Raw because most of the publicly available HAR datasets are made available in this format and the majority of previous work on representation learning focus on learning representations from raw data. The code used in this analysis was based on the code made available by Xu et al.  [12] at GitHubFootnote 2. Notice that this is the same dataset of UCI, but Xu et al.  [12] at UCI-Raw and Tonekaboni et al.  [10] at UCI-FE uses different pre-processing steps and dataset partitions, which might contain modifications that introduce bias to the results.

The original UCI dataset contains six classes: Walking, Walking Upstairs, Walking Downstairs, Sitting, Standing, and Laying. Later, it was extended to include postural transitions (HAPT [9]). In this work we employ only the original six classes. The UCI-Raw dataset consists of records of 6 channels, being three axes of accelerometer and three of gyroscope, sampled at 50 Hz concatenated by the user and segmented in windows of 2.56 s with no overlap, corresponding to 128-time points, following the data processing steps of [12].

The selection of the window size \(\delta \) depends on prior knowledge of the signals, selected in a manner that contains just enough information of the current state of the signal. The 4,600 windows of the dataset were split, leaving 70% for training, 15% for validation, and 15% for testing, using the same partitions in all experiments. We ensured that all windows from the same user always fall into the same subset. The weight parameter W was chosen based on previous work, being set as 0.05 ( [10]) and 0.2 ( [12]). The encoding size was kept constant at 320, a batch size of 16, a learning rate of 0.00001, and 100 training epochs.

To evaluate the discriminative performance of each method at the downstream task, we used the linear readout protocol, applying logistic regression on top of the frozen representations learned by the encoder. This is made to assess the capability of the features learned by the encoder to distinguish between the classes using the metrics accuracy, area under the precision-recall curve (AUPRC), balanced accuracy, and \(F_1\)-score. Training time was measured in a machine with a 16GB Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz and an 8GB NVIDIA GTX 1080 GPU.

We also compared the variations of TNC in a qualitative manner by visualizing the generated representations in the latent space on a 2D chart with the aid of t-SNE [5]. The goal was to verify whether the dataset samples are clustered according to their classes.

4.2 Quantitative Analysis

Table 2 presents the results reported by Tonekaboni et al.  [10] and Xu et al.  [12], along with the results produced with our code (named as “repro”) using the same parameters they used. Notice that the accuracy and AUPRC results are very similar to the ones reported by Tonekaboni et al. and Xu et al.

Table 2. Results in percentage for replication of original works

Table 3 shows the results for the four TNC variants with parameter W equal to 0.05 and 0.20. The first four rows contain the variants with the RNN encoder, while the last ones contain the ones with the TS2Vec encoder. The first observation is that the TS2Vec encoder significantly outperforms the RNN encoder for this task, achieving accuracies that are 15 to 17% points higher.

Table 3. Results in percentage with mean and standard deviation for different implementations of TNC on UCI Dataset Raw.

The TNC-TS2Vec-sim variant worked best with W = 0.20 while the other alternatives worked best with W = 0.05. However, the impact on performance is very small (\(\le \)1% on accuracy).

As for using a statistical test or cosine similarity, the second one showed little impact on the metrics, which indicates that the method to select the neighborhood did not show major differences in the results for HAR data. However, one positive impact of using cosine similarity is the training time (last column), which showed to be from \(7 \times \) to \(9 \times \) faster for both encoders and weight parameters, being even faster when using RNN as the encoder. Training time is relevant, especially for works that need a fast implementation to evaluate different SSL strategies, as TNC was presented as a very time-consuming and less competitive method in works such as [12, 17], which considered TNC with RNN as the encoder as being 250 times slower than the TS2Vec framework for time series representation learning.

By analysing the second row of Table 2 (TNC-RNN-adf/\(W=0.05\) on UCI-FE) and the first row of Table 3 (TNC-RNN-adf/\(W=0.05\) on UCI-Raw) it is possible to see that it was easier for the TNC-RNN-adf to learn from the UCI-FE (Accuracy=88.0%) dataset than the UCI-Raw (Accuracy=78.7%). This might be attributed to the handcrafted features. Nonetheless, TNC-TS2Vec-sim and TNC-TS2Vec-adf were capable of learning good features from the UCI-Raw dataset, achieving 95% of accuracy. This result suggests that the TNC technique combined with the TS2Vec encoder is capable of automatically learning very discriminative features, reaching the best performance for UCI-Raw in our experiments.

To validate our findings, we performed a statistical analysis after running each one of the rows of Table  3 experiment eight times, forming a sample of 64 runs, 32 for each encoder (RNN and TS2Vec). We first assessed the normality of the data for each encoder, using Shapiro-Wilk test, that presented a p-value below 0.05 for the RNN, indicating a significant deviation from a normal distribution. Given that, we proceeded to the non-parametric paired Wilcoxon signed-rank test. The test resulted in a p-value below 0.05, which indicates that the two encoder means are statistically significant different.

4.3 Qualitative Analysis

This section provides a qualitative analysis of the learned representations.

First, we reproduce the t-SNE plots presented by Tonekaboni et al.  [10]. Figure 2 shows the t-SNE plot for the test set of the UCI-FE dataset (left) and for its samples encoded by the TNC-RNN-adf variant (right). It is possible to notice that the handcrafted features already splits the data in three major clusters: (Laying), (Walking Downstairs; Walking Upstairs), and (Sitting; Standing; Walking). However, there is still some confusion among classes inside these clusters. The TCN-RNN-adf encoder, on the other hand, was capable of improving the representation by separating the classes on six, more separable, clusters, in line with the results reported by Tonekaboni et al.  [10] at Appendix A.7.2.

Fig. 2.
figure 2

t-SNE of TNC-RNN-adf Encoder on Data with Features

Now, we turn our attention to the representations learned by the four variants of TNC on the UCI-Raw dataset. We start by analyzing the t-SNE chart for the raw test set, showed in Fig. 3. Notice that, despite the existence of two clusters, there is no clear separation between most classes. This explains why it is easier to learn from the UCI-FE dataset than from the UCI-Raw one.

Fig. 3.
figure 3

t-SNE chart for the UCI-Raw dataset

Figure 4 shows the t-SNE charts for the representations learned by the four variants of TNC. Since the W parameter had little impact on the result, we decided to report only the best result for each one of the variants in terms of accuracy. Notice that the variants with the TS2Vec encoder (bottom) perform better than the ones with the RNN encoder (top), managing to provide a clear separation between most classes. Also, the choice of using ADF or Cosine Similarity to select the neighborhood had little impact on the clustering.

Fig. 4.
figure 4

Comparison between t-SNE on different TNC variations

5 Conclusions and Future Work

In this work, we evaluated different variants of TNC using the feature engineered (UCI-FE) and the raw (UCI-Raw) versions of the UCI HAR dataset. UCI-FE was the version used by the original work of TNC [10], while UCI-Raw is more often used to evaluate machine learning techniques. Regarding the TNC, we evaluate the impact of two different encoders (RNN and TS2Vec) and two different neighborhood selection functions (ADF and Cos. Sim.) have on the performance of the model when trained with the UCI-Raw dataset.

Our evaluation of different TNC variants using the UCI-Raw dataset demonstrated that the TS2Vec encoder significantly outperforms the RNN encoder for this task, achieving accuracies that are 15 to 17% points higher. Additionally, we found that replacing the ADF statistical test with cosine similarity has a minimal impact on model accuracy but reduces training times from \(7 \times \) to \(9 \times \).

Furthermore, our assessment of the impact of applying a TNC encoder to both raw data (UCI-Raw) and data represented by handcrafted features (UCI-FE) indicates that, while learning from handcrafted features is easier, the more advanced versions of TNC can effectively learn robust features from raw data, achieving performance comparable to models trained on handcrafted features, i.e., 95% of accuracy.

As future work, an ablation study of different hyperparameters can be added to the discussion, specially the weight parameter and window size and the performance of the fully supervised training using the encoders with a linear classification layer can be added. Also, an analysis of the model’s performance on different training sets at the downstream task can also be included. Other HAR datasets or encoders can be used in the same analysis to assess their impact at the TNC variations. Also, other methods can be tested with different encoders in the same data to enrich the comparison. Other tasks can also be used in the intention of to evaluate TNC not in a specific downstream task such as HAR.

Reproducibility Statement. Sect. 4 defines materials and methods used in our evaluations. The code used is released publicly at GitHubFootnote 3. The datasets used are publicly available, and a script containing data processing is also available.