Impact of Pre-training Datasets on Human Activity Recognition with Contrastive Predictive Coding

da Silva, Betania E. R.; Napoli, Otávio O.; Delgado, J. V.; Rocha, Anderson R.; Boccato, Levy; Borin, Edson

doi:10.1007/978-3-031-79035-5_21

Betania E. R. da Silva⁹,
Otávio O. Napoli⁹,
J. V. Delgado⁹,
Anderson R. Rocha⁹,
Levy Boccato⁹ &
…
Edson Borin⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15414))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

391 Accesses

Abstract

Self-Supervised Learning (SSL) techniques have been successfully employed to learn useful representations for various data modalities without labels. These techniques use a pretext task to train the backbone of a deep-learning model without labels and then leverage the pre-trained backbone to train a downstream model with a few labeled samples. In this context, Contrastive Predictive Coding (CPC) is an SSL technique that has demonstrated promising results in several tasks, including human activity recognition (HAR). In this work, we explore the impact of data variety on backbone pre-training when designing CPC models for HAR and the benefits of pre-training on the final model. We evaluated the impact of data variety on model pre-training using fifteen combinations of four distinct HAR datasets, finding significant performance variability based on the pre-training datasets, with \(F_1\)-score varying from 9.6 to 13% points across different target datasets. We also found that including the target dataset in the pre-training process generally improved performance and that pre-training with all four datasets produced a high-quality backbone, yielding downstream models performing near the best on all target datasets. These findings emphasize the importance of selecting pre-training datasets aligned with the downstream task domain. Additionally, we demonstrated that CPC pre-training significantly benefits downstream model performance with limited data, achieving comparable \(F_1\)-scores with just 5% of the data as with 100%, indicating that CPC effectively captures essential features of the problem domain.

Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF

Personalized Human Activity Recognition with Transfer Learning

Smaller Can Be Better: Efficient Data Selection for Pre-training Models

Standardizing Your Training Process for Human Activity Recognition Models – A Comprehensive Review in the Tunable Factors

1 Introduction

Self-Supervised Learning (SSL) has emerged as a powerful approach for extracting meaningful representations from unlabeled data. It leverages pretext tasks to learn useful features without the need for manual annotations. This paradigm not only expands the utility of available data but also allows for the creation of models that can be efficiently tuned to specific tasks with minimal effort [1, 4].

Within SSL, Contrastive Predictive Coding (CPC) [12] is an effective technique for dealing with temporal data and have demonstrated promising results in time series, especially for human activity recognition (HAR) [3, 6] For example, Haresamudram et al. [6] reported an \(F_1\)-score of 0.832 for the UCI dataset and 0.894 for MotionSense, reaching performance levels close to those attained by state-of-the-art deep models trained in a fully-supervised manner.

Although there is a vast amount of literature on SSL methods for HAR, few works explore the impact of pre-training datasets on the quality of representations learned by SSL models. This work aims to explore this gap by evaluating representations generated by CPC when trained with 15 different combinations of four publicly available HAR datasets - KuHar, UCI, MotionSense, and Real World (waist).

We quantitatively assessed performance using the \(F_1\)-score metric and conducted a qualitative analysis of data distribution using t-SNE [8]. Furthermore, we extended our analysis to low-data regimes by fine-tuning the model on the downstream task with varying percentages of labeled samples.

The main contributions of this work are: demonstrated that data variety during pre-training can significantly impact performance; showed that including the target dataset in pre-training improves performance; and proved that pre-training with all datasets produces a high-quality backbone, leading to downstream models that perform near the best across all tasks. Additionally, we showed that CPC pre-training provides substantial benefits in low-data regimes, achieving results comparable to using 100% of the data with only 5% of labeled samples.

The remainder of this paper is organized as follows: Sect. 2 discusses the related work. Section 3 introduces the key SSL terms used in this work in order to provides an overview of the CPC technique and its components. Section 4 presents the experimental results. Finally, Sect. 5 summarizes the main conclusions.

2 Related Work

Self-Supervised Learning (SSL) techniques aim to train a robust feature extractor, known as the backbone, which can later be combined with a prediction head (e.g., a classifier, regressor, etc..) to solve a target task (e.g., HAR) [1, 4].

In essence, this process involves two phases: pre-training and fine-tuning. In pre-training, a machine learning model is trained to solve a pretext task, which is a task that does not require labels, but is designed to boost the model to learn useful features, for instance, predicting the missing parts of an image. The model, also known as “pretext model”, consists of a backbone followed by a projection head, which is usually a multi-layer perceptron and it is trained using a large amount of unlabeled data. In the second phase, the pre-trained backbone^{Footnote 1} is combined with a prediction head to form the downstream model, which is trained in a supervised manner to solve the target (or downstream) task, for example, HAR. This phase usually requires a small amount of labeled data, which is used to fine-tune the model to the target task.

Recently, SSL has proved to be effective in HAR, outperforming fully-supervised training schemes. For instance, Saeed et al. [14] showed that, by using a pretext task based on transformation classification, they could improve the discrimination of the model by up to 4%, in comparison to fully-supervised and unsupervised methods (e.g., auto-encoders), in datasets like HHAR, UniMiB, UCI-HAR, WISDM, and MotionSense, using raw accelerometer and gyroscope data.

Haresamudram et al. [5] proposed a SSL technique using masked auto encoders where a subset of the input data is randomly selected and perturbed. Then, the pretext task is to reconstruct the original input. The authors demonstrated that the model learns useful representations and achieves good results in HAR using the MotionSense and USC-HAD datasets.

Tang et al. [16] adapted SimCLR [2], a popular SSL technique based on contrastive learning, for HAR. This method uses two augmentations of the same sample to generate positive pairs and different samples to generate negative pairs. The authors reported a performance gain of over 2% using their pre-trained scheme. In a later work [15], the authors proposed SelfHAR, a teacher-student architecture for human activity recognition that combines semi-supervised and self-supervised learning, reporting a significant performance improvement across six datasets with different device placements.

Thukral et al. [17] proposed a teacher-student architecture for HAR, combining cross-modal self-supervised learning with semi-supervised learning. The teacher model, trained with augmented labeled data, generates pseudo-labels for the student model, which is trained with unlabeled data using the SimCLR technique. The student model can then be fine-tuned with a small amount of labeled data. Their method significantly improves performance, especially on the MobiAct dataset, in a few-shot learning scenario. However, they note that techniques like CPC excel in few-shot learning using the HHAR dataset for pre-training.

Among several SSL techniques, a technique that has recently stood out in the treatment of temporal signals is the Contrastive Predictive Coding (CPC), first introduced in 2018 by Oord et al. [12]. CPC aims to learn useful representations from temporal data by predicting future representations of the data. This technique will be further detailed in Sect. 3. The authors demonstrated that CPC can learn useful representations, achieving strong performance across four distinct domains: speech, images, text, and reinforcement learning. However, their work does not address the HAR task nor the impact that pre-training datasets have on the downstream model.

In 2021, Haresamudram et al. [6] adapted the CPC technique for HAR. The authors reported an improvement of up to 9% compared to Saeed et al. [14] in the MobiAct dataset and other improvements in MotionSense, UCI-HAR, and USC-HAD datasets.

Dhekane et al. [3] investigates the data requirements of CPC for HAR using the HHAR, PAMAP2, and RealWorld datasets. They randomly selected data from these datasets for representation learning with minimal annotations, applying data augmentation techniques like noise addition, scaling, and channel shuffling. Their work includes two evaluations: “Bottom-Up Evaluation”, determining the minimal amount of target domain data needed for useful representations, and “Transferred Assessment”, examining the minimal external data required for effective activity recognition in the target domain. Results show CPC is highly data-efficient, needing only 15 and 5 min of pre-training data for “Bottom-Up Evaluation” and “Transferred Assessment”, respectively, to achieve comparable performance to full data pre-training. The authors conclude that target domain data is more beneficial for representation learning.

The choice of the dataset for pre-training can significantly impact downstream model performance. Qian et al. [13] examined various components in contrastive-based SSL techniques for HAR, including backbone architecture, augmentations, and contrastive pair construction. They also analyzed data-scarce cross-person generalization, where models are trained on a small subset of individuals and tested on a left-out individual. Results showed that generalization levels vary with the pre-training dataset. The authors suggest that self-supervised methods’ effectiveness can be dataset-dependent and that the choice of backbone architecture is crucial for model performance.

Table 1. Related Works summary.

Full size table

As SSL techniques have been shown to be effective in several scenarios, Table 1 classify the related works according to the following criteria:

CPC: evaluate the Contrastive Predictive Coding SSL technique;
HAR: perform the evaluation on HAR;
Multiple Pre-training sets: evaluate the impact of pre-training datasets on final model performance, and
Few-shot learning: evaluate the model when trained with few labeled data.

Notice that the closest related work is the work of Dhekane et al. [3], which evaluates CPC for HAR taking into account multiple pre-training sets and evaluating the model when trained with few labeled data. Nonetheless, our work differs from theirs in the following aspects: (i) we perform a qualitative assessment of the learned representations using t-SNE, (ii) we do not employ data augmentation and (iii) we evaluate the impact of different combinations of HAR datasets on the quality of the learned representations.

3 Contrastive Predictive Coding (CPC)

The CPC technique was first introduced by Oord et al. [12], and later adapted by Haresamudram et al. [6] for HAR. We rely on the implementation of Haresamudram et al. to perform our experiments, which is illustrated in Fig. 1.

The CPC backbone comprises two main components: the input encoder (\(g_{enc}\)), which encodes the input time series (x) into z, and an auto-regressor (\(g_{ar}\)), which encodes the initial time steps of z (i.e., z[0 : t]) into a representation called \(c_t\). CPC also contains several linear predictors (\(W_1\), ..., \(W_k\)), which are used to predict a subset of z (i.e., \(z[t+1:t+k]\)) from \(c_t\).

During the training, a random time step t is chosen, dividing z into two segments: past (elements up to t) and future (elements after t). The time steps in the past segment (i.e., z[0 : t]) are encoded by \(g_{ar}\), generating the context vector \(c_t\). This vector is then used as input to several linear models (\(W_1\), \(W_2\), ... \(W_k\)) to predict the next k time steps in z (i.e., \(z_{t+1}\), ..., and \(z_{t+k-1}\)). In this case, these linear models work as projection heads.

CPC optimization aims to maximize the mutual information between \(c_t\) and \(z_{t+k}\), (with \(k>0\)). This is accomplished through the contrastive loss function InfoNCE [12], which increases the similarity between the positive pair (context and correct future representation) and decreases the similarity between negative pairs (context and incorrect future representations). Once the model is trained, the projection heads are discarded and the \(g_{enc}\) and \(g_{ar}\) models, i.e., the backbone, is attached to a prediction head to compose the downstream model.

In our experiments, we employed the same encoder (\(g_{enc}\)) and auto-regressor (\(g_{ar}\)) as Haresamudram et al. [6]: \(g_{enc}\) is composed of three blocks, each containing a 1D convolutional layer with 32, 64, and 128 channels, kernel size 3, ReLU activation function, and dropout of 0.2; and \(g_{ar}\) is composed of a two-layer Gated Recurrent Unit (GRU) with 256 units. This network uses a dynamic attention mechanism, allowing \(g_{ar}\) to capture subtle and relevant temporal features, even in long sequences, by considering distant samples less relevant.

For the downstream model, we used the same prediction head as Haresamudram et al., which consists of an MLP with three linear layers. Specifically, it includes a linear layer with 256 neurons, batch normalization, ReLU activation, and a 0.2 dropout. This is followed by another linear layer with 128 neurons, batch normalization, ReLU activation, and a 0.2 dropout. The final linear layer reduces the dimension to the number of classes.

4 Experimental Results

This section details the experiments conducted to evaluate CPC’s effectiveness for HAR. Section 4.1 presents the materials, while Sect. 4.2 details the methods used. Section 4.3 shows validation experiments, reproducing literature results to ensure compatibility. Section 4.4 evaluates the impact of pre-training datasets on the downstream task for four target scenarios. Section 4.5 uses \(t\)-SNE to visualize sample distribution in the latent space and qualitatively analyze CPC’s learned representation. Finally, Sect. 4.6 assesses the impact of pre-training on downstream model performance with limited data.

4.1 Materials

Datasets: The experiments were conducted with four publicly available datasets for human activity recognition: UCI [7] (UCI), KuHAR [10] (KH), MotionSense [9] (MS), and RealWorld [11] (RW).

These datasets consist of time series data collected from smartphone accelerometers and gyroscopes. The smartphones were positioned differently across the datasets: in a waist bag for UCI and KH, in a pocket for MS, and at the waist for RW.

To explore the effectiveness of CPC in different pre-training scenarios, we generated all 15 combinations of the four aforementioned datasets as follows: using a single dataset, combining datasets two by two, combining three by three, and, finally, combining all four datasets.

The UCI dataset has 12 classes, 6 that represent continuous activities (e.g., walking), and 6 for transitions (e.g., sit to stand). For the pretext task, we employ all the samples; however, for the downstream task, only samples belonging to the continuous activities are used. This is performed to ensure we do not have to discard too much data when balancing the dataset for the supervised training. We distinguish these subsets by the acronyms UCI-12 and UCI-6, respectively.

The other datasets, i.e., KuHar, MotionSense, and RealWorld, remained the same in the downstream and pretext tasks. Table 2 summarizes the datasets’ characteristics.

Table 2. Overview of datasets used in this work.

Full size table

To standardize and combine the datasets, several preprocessing steps were undertaken. First, we standardized the sampling rate to 50 Hz and converted the acceleration data to units of \(\frac{m}{{s}^{2}}\). The dataset was then partitioned based on user IDs, with 70% of users allocated to the training set, 10% to the validation set, and 20% to the test set, ensuring that no samples from a single user appeared in multiple subsets. Finally, the data was segmented into windows of 50 time steps (1 s) with a 50% overlap.

Code: We adapted the code made available by Haresamudram et al. [6]^{Footnote 2}. The modified version of the code can be found at our GitHub repository^{Footnote 3}.

Hyperparameters: We also employ the same experimental setup described by Haresamudram et al. [6] to ensure reproducibility and comparability of results. Table 3 lists the hyperparameters used in this work.

Table 3. Hyperparameters used in the experiments for training the pretext and downstream tasks.

Full size table

4.2 Methods

Methodology: The CPC model, as described in Sect. 3, is first pre-trained with the unlabeled combinations of datasets, adjusting the parameters of the backbone, which is composed of the concatenation of the \(g_{enc}\) and \(g_{ar}\) models. Then, the downstream model is built by concatenating the pre-trained backbone to an untrained prediction head (an MLP with three linear layers), and trained using the training subset of the target dataset. However, in this process, the backbone weights are frozen and only the prediction head parameters are adjusted. Finally, we evaluate the downstream model on the test subset of the target dataset, reporting the \(F_1\)-score metric.

We also employ \(t\)-SNE to visualize the distribution of the samples on the latent space, i.e., using the features extracted by the backbone. This is performed using the training subset of the target datasets.

4.3 Validation

Initially, we validated our code base by reproducing state-of-the-art results. This involved verifying the effectiveness of CPC through direct comparison with results reported by Haresamudram et al. [6]. We employed the same hyperparameters, pre-training datasets, and target datasets: UCI-6 without the standardization employed in the rest of our work on acceleration unit in G, and MotionSense as described in Table 2. Table 4 shows the \(F_1\)-score reported by Haresamudram et al. and the ones produced in our own experiments. The results are very similar, indicating that the reproduction was successful and our implementation is a solid basis for subsequent experiments.

Table 4. \(F_1\)-score reported by Haresamudram et al. [6] and reproduced by us for the UCI and MotionSense datasets.

Full size table

4.4 Impact of Pre-training Datasets on Downstream Tasks

We evaluate the datasets impact used on pre-training have on four target scenarios. The goal is to investigate whether the set of datasets used in the pre-train task significantly affects the performance of the downstream model. To this end, we pre-trained the CPC backbone using all possible combinations of the four original datasets, including training on each dataset individually, in pairs, in trios, and finally, with all datasets combined.

The results are organized into tables, one for each target scenario, i.e., a target HAR dataset in the HAR task (downstream task). For example, Table 5 shows the results when using KuHar as the target dataset. In this case, each row shows which datasets were used on pre-training and the \(F_1\)-score of the downstream model when evaluated on the test subset of the KuHAR dataset. Tables 6, 7, and 8 exhibit the results when using MotionSense, RealWorld (waist), and UCI-6 as target datasets, respectively. The rows in Tables 5 through 8 are arranged in descending order based on the \(F_1\)-score.

Table 5. Kuhar results.

Full size table

Table 6. MotionSense results.

Full size table

Table 7. RealWorld (waist) results.

Full size table

Table 8. UCI-6 results.

Full size table

The initial observation is that the \(F_1\)-score can vary significantly depending on the datasets used for pre-training: observe that the \(F_1\)-score varies by up to 12.9, 9.6, 10.5, and 13% points on KuHar, MotionSense, RealWorld, and UCI-6, respectively. This indicates that the quality of the backbone model is sensitive to the choice of pre-training datasets.

We also observed that, in most cases, there is a significant drop in performance when the target dataset is not included in the pre-training set for the backbone. This effect is particularly pronounced in the KuHar and MotionSense datasets. For KuHar (Table 5), the \(F_1\)-score ranges from 0.770 to 0.789 when KuHar is part of the pre-training set, but it drops to below 0.712 when it is excluded. Similarly, in the MotionSense dataset, the \(F_1\)-score decreases by three percentage points when the target dataset is omitted from pre-training. A similar trend is observed for UCI-6 (Table 8), where the highest-performing configurations generally include the UCI dataset in the pre-training set.

We employed the Wilcoxon test to determine whether the performance differences observed when the target dataset is included in the pre-training set are statistically significant. Specifically, we compared the seven configurations that exclude the target dataset against the corresponding seven configurations that include it. The Wilcoxon test yielded a p-value of 0.015625 for the KuHar, MotionSense, and UCI-6 datasets, indicating that the observed differences are statistically significant.

The results for the RealWorld dataset deviate from this trend. As shown in Table 7, although the best-performing configuration includes RealWorld in the pre-training set, several other configurations, including the second-best one, achieve high \(F_1\)-score values without it. In this case, the Wilcoxon test produced a p-value of 0.296875, which is well above the 5% significance threshold. Therefore, we cannot conclude that including the RealWorld dataset in pre-training led to performance improvements.

It is worth noting that the performance of the RealWorld dataset generally improves when the KuHar dataset is included in the pre-training set. This improvement may be attributed to the fact that both datasets were collected with the device positioned at the waist and that the activity set of RealWorld is a subset of the KuHar activity set. We plan to explore this relationship further in future work.

Finally, we observed that the performance obtained when pre-training the backbone using the combination of the four datasets (KH+MS+RW+UCI-12) is very close or equal to the best \(F_1\)-score value achieved in the downstream task for all datasets, as evidenced at Table 9.

Table 9. Performance rate when pre-training with all datasets.

Full size table

4.5 Qualitative Analysis of the Representations

In order to perform a qualitative analysis, we used t-distributed Stochastic Neighbor Embedding (\(t\)-SNE) [8] to visualize the distributions of the dataset samples on the original representation (i.e., time domain or raw data), and on the representation learned by the backbone when trained with the best combination of datasets. These combinations are indicated in the first row of Tables 5 to 8.

Figure 2a shows the \(t\)-SNE chart for two representations of the UCI dataset, the original one (UCI Raw), and the one learned by the CPC backbone (UCI + CPC). It is worth noting that the CPC backbone provided better class separation, with most activities forming well-defined clusters. However, there is some overlap between sitting and standing, both low-energy activities, and upstairs shows a few samples mixed with other activities. Even so, the overall separability remains clearly visible.

Figure 2b shows the \(t\)-SNE charts for MotionSense. Again, it indicates that the CPC backbone was able to provide a better separation of the classes, as the sitting and the standing samples are clearly separated from other samples.

Similarly, the CPC backbone was also able to provide a much better separation of the classes on the KuHar, as illustrated in Fig. 3a. The 18 classes were very mixed in the raw data, as indicated in Fig. 3a. The CPC backbone yielded a much better separation, creating several homogeneous clusters.

Finally, the samples from the RealWorld dataset seem to already form clear activity clusters on the raw data, as illustrated in Fig. 3b. In this case, the CPC backbone did not offer much improvement, but at least it kept the same degree of clustering. Therefore, we recognize that, in general, the CPC backbone was able to improve the samples clustering or preserve the original clustering seen on the raw data.

4.6 Impact of Pre-training on Performance with Limited Data

SSL excels when training models on large datasets with minimal labeled data. In this experiment, we assess how the pre-training process affects the performance of the downstream model when trained with limited data, i.e., using fractions of the training set. We evaluate two versions of the downstream model on each dataset: one trained from scratch, without pre-training (named “From Scratch”), and the other utilizing backbone weights - \(g_{enc}\) (convolutional) + \(g_{ar}\) (GRU) - learned during pre-training. The pre-training is conducted with CPC technique considering the combination of the four datasets.

To evaluate the impact of the amount of labeled data on the downstream models, we train them using subsets of the training dataset. We use the following percentages of the training set: 1%, 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, and 100%. The data are shuffled before selecting the subsets; however, for a fair comparison, we use the same subset when training both downstream models. Finally, we assess the models’ performance by reporting the \(F_1\)-score using the test set of each dataset.

Figure 4 shows the \(F_1\)-score for both models when trained with fractions of the training set for each one of the target datasets. The chart includes a blue dot representing the performance of a well-known model trained with supervised learning on the UCI dataset, as described in the literature. This model is a CNN as reported by Kun Xia et al. [18], using an architecture based on the work of Yang et al. [19].

The results demonstrate that the pre-training process offers substantial benefits for the downstream model across all datasets. Moreover, using the pre-training process, just 5% of the data was sufficient to achieve an \(F_1\)-score comparable to that obtained with 100% of the data. This highlights the effectiveness of pre-training with SSL over a fully-supervised training, particularly in scenarios with limited data. Furthermore, it is possible to observe that the downstream model pre-trained with CPC achieved nearly the same \(F_1\)-score as the convolutional network from the literature.

5 Conclusions

In this work we evaluated how the Contrastive Predictive Coding (CPC) SSL technique performs on HAR tasks. First, we assessed the impact of data variety on model pre-training using 15 combinations of four distinct HAR datasets. Our evaluation reveals that model performance can vary significantly based on the pre-training datasets, with \(F_1\)-score changes ranging from 9.6 to 13% points across different target datasets (i.e., the ones used on the downstream task). It also shows that, in most cases, there is a noticeable difference in performance when the target dataset is included in the pre-training process. Moreover, we observed that using all four datasets during pre-training produced a high-quality backbone, leading to downstream models that perform very close to the best models on all target datasets. These results highlight the critical importance of selecting pre-training datasets aligned with the downstream task domain.

We then conducted a qualitative analysis of the backbone pre-trained with all four datasets and found that it generally extracted features that effectively enhanced the clustering of samples by their classes.

Finally, we assessed the impact of pre-training on the performance of the downstream model when trained with limited data, i.e., using only a fraction of total data used to train, and showed that the pre-training process provided significant benefits across all target datasets. This suggests that CPC enables the backbone to learn subtle and essential features of the problem domain. Additionally, with the pre-trained backbone, using just 5% of the data was sufficient to achieve an \(F_1\)-score comparable to that obtained with 100% of the data.

For future work, we suggest: (i) investigating how varying the amount of data used during pre-training impacts the learned representation, similar to the approach taken by Dhekane et al. [3]; and (ii) exploring whether including datasets with similar characteristics can enhance the performance of the backbone model.

All codes are open-source and available at https://github.com/H-IAAC/KR-Papers-Artifacts. The datasets can be made available upon request for the purpose of reproducibility.

Notes

1.
The projection head is usually discarded and it is not used on the downstream model.
2.
https://github.com/harkash/contrastive-predictive-coding-for-har.
3.
https://github.com/H-IAAC/KR-Papers-Artifacts.

References

Balestriero, R., et al.: A cookbook of self-supervised learning. arXiv:2304.12210 (2023)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: SimCLR: a simple framework for contrastive learning of visual representations. In: International Conference on Learning Representations, vol. 2 (2020)
Google Scholar
Dhekane, S.G., Haresamudram, H., Thukral, M., Plötz, T.: How much unlabeled data is really needed for effective self-supervised human activity recognition? In: Proceedings of the 2023 ACM International Symposium on Wearable Computers (2023)
Google Scholar
Ericsson, L., Gouk, H., Loy, C.C., Hospedales, T.M.: Self-supervised representation learning: introduction, advances, and challenges. IEEE Sig. Process. Mag. 39(3) (2022)
Google Scholar
Haresamudram, H., et al.: Masked reconstruction based self-supervision for human activity recognition. In: ISWC 2020, pp. 45–49 (2020)
Google Scholar
Haresamudram, H., Essa, I., Plötz, T.: Contrastive predictive coding for human activity recognition. In: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 5, no. 2 (2021)
Google Scholar
Reyes-Ortiz, J.L., Anguita, D.: Smartphone-based recognition of human activities and postural transitions (2015)
Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using T-SNE. J. Mach. Learn. Res. 9(11) (2008)
Google Scholar
Malekzadeh, M.: MotionSense Dataset : Smartphone Sensor Data - HAR
Google Scholar
Nahid, A.A., Sikder, N., Rafi, I.: KU-HAR: an open dataset for human activity recognition. Mendeley Data (2021)
Google Scholar
DataSet - RealWorld. Universität Mannheim, University of Mannheim
Google Scholar
Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv:1807.03748 (2018)
Qian, H., Tian, T., Miao, C.: What makes good contrastive learning on small-scale wearable-based tasks? In: Zhang, A., Rangwala, H. (eds.) KDD ’22: The 28th ACM SIGKDD and Data Mining (2022)
Google Scholar
Saeed, A., Ozcelebi, T., Lukkien, J.: Multi-task self-supervised learning for human activity detection. Proc. ACM Interact. Mob. Wearable Ubiquit. Technol. 3(2) (2019)
Google Scholar
Tang, C.I., Perez-Pozuelo, I., Spathis, D., Brage, S., Wareham, N., Mascolo, C.: SelfHAR: improving human activity recognition through self-training with unlabeled data. 5(1) (2021)
Google Scholar
Tang, C.I., Perez-Pozuelo, I., Spathis, D., Mascolo, C.: Exploring contrastive learning in human activity recognition for healthcare. arXiv:2011.11542 (2020)
Thukral, M., Haresamudram, H., Ploetz, T.: Cross-domain HAR: few shot transfer learning for human activity recognition. arXiv:2310.14390 (2023)
Xia, K., Huang, J., Wang, H.: LSTM-CNN architecture for human activity recognition. IEEE Access 8 (2020)
Google Scholar
Yang, J., Nguyen, M.N., San, P.P.: Deep convolutional neural networks on multichannel time series for human activity recognition (2015)
Google Scholar

Download references

Acknowledgments

This project was supported by the Ministry of Science, Technology, and Innovation of Brazil, with resources granted by the Federal Law 8.248 of October 23, 1991, under the PPI-Softex [01245.003479/2024-10]. The authors also thank CNPq (315399/2023-6 and 404087/2021-3) and Fapesp (2013/08293-7) for their financial support.

Author information

Authors and Affiliations

Hub for Artificial Intelligence and Cognitive Architectures (H.IAAC), University of Campinas, Campinas, 13083-852, Brazil
Betania E. R. da Silva, Otávio O. Napoli, J. V. Delgado, Anderson R. Rocha, Levy Boccato & Edson Borin

Authors

Betania E. R. da Silva
View author publications
Search author on:PubMed Google Scholar
Otávio O. Napoli
View author publications
Search author on:PubMed Google Scholar
J. V. Delgado
View author publications
Search author on:PubMed Google Scholar
Anderson R. Rocha
View author publications
Search author on:PubMed Google Scholar
Levy Boccato
View author publications
Search author on:PubMed Google Scholar
Edson Borin
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Betania E. R. da Silva .

Editor information

Editors and Affiliations

Universidade Federal Fluminense, Niterói, Brazil
Aline Paes
Instituto Tecnológico de Aeronáutica, São José dos Campos, Brazil
Filipe A. N. Verri

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

da Silva, B.E.R., Napoli, O.O., Delgado, J.V., Rocha, A.R., Boccato, L., Borin, E. (2025). Impact of Pre-training Datasets on Human Activity Recognition with Contrastive Predictive Coding. In: Paes, A., Verri, F.A.N. (eds) Intelligent Systems. BRACIS 2024. Lecture Notes in Computer Science(), vol 15414. Springer, Cham. https://doi.org/10.1007/978-3-031-79035-5_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-79035-5_21
Published: 30 January 2025
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-79034-8
Online ISBN: 978-3-031-79035-5
eBook Packages: Computer ScienceComputer Science (R0)

Keywords

Publish with us

Policies and ethics

Impact of Pre-training Datasets on Human Activity Recognition with Contrastive Predictive Coding

Abstract

Similar content being viewed by others

Personalized Human Activity Recognition with Transfer Learning

Smaller Can Be Better: Efficient Data Selection for Pre-training Models

Standardizing Your Training Process for Human Activity Recognition Models – A Comprehensive Review in the Tunable Factors

1 Introduction

2 Related Work

3 Contrastive Predictive Coding (CPC)

4 Experimental Results

4.1 Materials

4.2 Methods

4.3 Validation

4.4 Impact of Pre-training Datasets on Downstream Tasks

4.5 Qualitative Analysis of the Representations

4.6 Impact of Pre-training on Performance with Limited Data

5 Conclusions

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Keywords

Publish with us