1 Introduction

Personalized recommendations have become an increasingly common practice in our digital lives [5]. With the popularization of technology and ease of making information available, users have access to many items such as movies, music, and news, making it challenging for them to discover everything available and explore their interests. Recommender systems then emerged in the 1990s to overcome this problem, employing filtering techniques to direct the user to a restricted subset of relevant items [2].

Among several types of recommender systems, collaborative filtering is currently the most prolific in scientific literature and practical applications. Their popularity can be credited to its ease of application, since it depends only on a matrix of interactions between users and items to generate recommendations based on similar tastes among users [5].

In early studies, the general approach uses item-based neighborhood algorithms for generating the recommendation [7]. However, the representation commonly adopted grows fast in size as the number of users and items increases; in real-world scenarios, these numbers grow constantly and much faster than the number of user-item interactions. Consequently, the traditional representation model became highly sparse and with an inflated dimensionality, demanding more and more computational power and reducing the quality of the results [18].

Many approaches have been presented to overcome this problem, such as matrix factorization [20] and neural models [36]. In the latter, methods based on neural embeddings have been gaining ground in recent literature. These methods aim to represent items or users as distributed vectors, resulting in a representation with reduced dimensionality while keeping its latent meaning [4, 13].

Although many studies have proposed different methods based on neural embeddings [11, 31], most of these studies do not consider the moment when the user interacted with the item [6]. The static approach – as it is known – while very present in the literature, has proven flaws [21]. Taking into account the timing and order of interactions can lead to significant improvements in results [3, 6, 33].

In this context, this study shows how to adapt Item2Vec [4] – a pioneering method of neural embeddings for a recommendation context – to consider temporal information during training by outlining an approach based on a sliding window over time, a strategy commonly employed in similar scenarios [33].

2 Related Work

Neighborhood algorithms became very popular in the first collaborative filtering recommender systems, given their relative low complexity and ability to produce satisfying results. Early neighborhood algorithms represented users as vectors and yielded recommendations based on the interactions history of the users closest to the target user [7]. Subsequent approaches began to represent items as vectors [28]. This strategy achieved better results, the training was more efficient, and the results were more stable over time having less need for retraining [17].

Gradually, the data used by recommender systems has grown, bringing new challenges to the area such as high sparsity and dimensionality of the commonly adopted representation [18]. Many algorithms for dimensionality reduction have been proposed to overcome this problem. Among different strategies, matrix factorization techniques have become state-of-the-art and are still very relevant to this day [25]. In this approach, users and items are represented as matrices of reduced dimensionality, and the recommendation is made using user-item similarities [20].

With the growth in research of neural networks, the use of neural models for recommender systems has increased significantly in recent years. Among several proposals for using neural networks for recommendation, neural embeddings have been gaining ground in the literature [13]. Inspired by Natural Language Processing (NLP) neural models capable of generating vector representations of words [24], these models were adapted to learn low-dimensional embeddings for items and users [27].

Grbovic et al. [13] addressed an ad recommendation problem based on receipts collected by email and proposed the Prod2Vec and User2Vec methods. The former is based on a skip-gram architecture [24] learning product embeddings and recommending through item similarity, while the latter is based on Paragraph Vector [22], learning embeddings of users and items concurrently and recommending through the similarities between users and items.

With an approach similar to Prod2Vec, Barkan & Koenigstein [4] proposed Item2Vec in the following year. The method was also inspired by consolidated techniques from the NLP area [24] and achieved promising results, surpassing state-of-the-art matrix factorization methods in different tasks.

In the ensuing years, different strategies for learning neural embeddings in recommender systems were proposed, such as (i) adapting existing models to leverage item metadata [11, 31]; (ii) using content information during the embeddings training phase [14, 35]; (iii) employing more complex neural models, e.g., LSTMs or CNNs [29, 30]; (iv) training unaltered NLP models over textual data of items, users, or interactions [9, 15]; and (v) considering different types of user-item interaction [37].

Few studies on neural embeddings in recommender systems explored the impact of using temporal information, i.e., when the interaction happened. The most common approach is session-based recommendation, in which only items consumed by the user in a short period are used for training the embeddings, considering an ordered relationship [12] or not [13].

The static approach is still customary in recommender systems as a whole, even though it has flaws [21]. User’s interests change over time, and by not accounting for it, important information can be lost [6, 33]. It is known that considering temporal aspects tends to improve the quality of the recommendation and generate greater confidence for the user [3]. Additionally, static algorithms trained in offline scenarios can suffer a drop in performance when applied in online and real scenarios with temporal properties [21]

Recommender systems capable of learning relevant information from temporal data have been proposed since the beginning of the field [1], but this has gained relevance when timeSVD++ [19], a temporal adaptation of the matrix factorization method SVD++, won one of the rounds of The Netflix PrizeFootnote 1.

Two approaches can be applied to temporal information: using time as context, based on the idea of cyclical phenomena as it is done in Time-Aware Recommender Systems (TARS); or use time as a sequence, viewing temporal information as a chronologically ordered sequence, the strategy used in Time-Dependent Recommender Systems (TDRS) [33].

In TDRS two data interpretation methodologies are commonly adopted: the use of decay functions, which make newer interactions have more weight during the learning phase than older interactions [10]; and the use of sliding windows (also known as “forgetting techniques”) in which interactions are sorted chronologically, and only a slice of the data is consumed during training, with the remaining information receiving less weight or being ignored [33].

In the last decade, some studies showed promising results using sliding window strategies for recommender systems. For instance, Vinagre & Jorge [32] employed two different sliding window techniques in a neighborhood-based recommender. Matuszyk et al. [23] proposed five forgetting strategies for matrix factorization using temporal windows or fading factors during data preprocessing and model training. Wang et al. [34] trained a neural network to predict shopping baskets in an e-commerce environment using a unit-size sliding window over past baskets. All studies above surpassed baseline algorithms or improved scalability without compromising accuracy.

In line with the positive results that have been presented in the literature, we propose SeqI2V, a sequential Item2Vec [4] adaptation using sliding windows over time. In our proposal, the embeddings learning phase is influenced by when interactions occur, increasing the context these item representations carry. The final recommendation is statically computed using an item-based neighborhood approach, just like the original method.

3 The Proposed Method

Methods based on neural embeddings are becoming state-of-the-art for recommender systems, presenting results comparable to established approaches, such as neighborhood techniques and matrix factorization. The Item2Vec was a pioneering technique in introducing neural embedding techniques from Natural Language Processing to the recommendation scenario, adapting the skip-gram network to generate item embeddings. As stated by the authors, the method outperformed the well-known SVD in different evaluated tasks.

Some subsequent studies adapted Item2Vec to specific scenarios or improved the embeddings training phase [11, 31]. However, temporal information is still underexplored, limiting the application of Item2Vec to offline scenarios, in which pattern variations over time are ignored. To fill this important gap, we propose using sliding windows over time to introduce temporal information in the learning stage while keeping the elegant simplicity of Item2Vec.

In the following, we explain the traditional Item2Vec model and present SeqI2V – a sequential Item2Vec adaptation using sliding windows over time.

3.1 Item2Vec

Item2Vec is an artificial neural network composed of three layers: an input, a hidden and an output layer, as shown in Fig. 1.

Given a recommender system composed of a set U of users, a set I of items and multiple sets \(I_u\) of items consumed by user u, the input layer \(\mathcal {J}\) has a size |I|, as is comprised of items i encoded as one-hot vectors (all elements in the array have a value of 0, except for the one that represents the item, which has a content value of 1). The middle layer \(\mathcal {H}\) has size M, which corresponds to the desired dimensionality for the final vector representation of each item. Between the input and the middle layers, there is a weight matrix \(\mathcal {W}\), of size \(|I| \times M\), used to activate the initial layer through the operation \(\mathcal {H} = \mathcal {J} \cdot \mathcal {W}\). The output layer \(\mathcal {O}\) is made up of \(|I_u|-1\) vectors of size |I| connected to the middle layer by the transpose of the same weight matrix \(\mathcal {W}\) and a softmax activation function. Thus, \(\mathcal {O} = \sigma (\mathcal {H} \cdot \mathcal {W}^T)\), where \(\sigma \) represents the softmax function shown in Eq. 1.

$$\begin{aligned} \sigma (u, i) = \frac{e^{(\mathcal {W}_u \mathcal {W}^T_i)}}{\sum _{j \in I}{e^{(\mathcal {W}_u \mathcal {W}^T_j)}}} \end{aligned}$$
(1)
Fig. 1.
figure 1

Standard Item2Vec architecture

The network is trained over multiple epochs. For each epoch, each item \(i \in I_u\) for every user u is used as input. In the output layer \(\mathcal {O}\), we have \(|I_u|-1\) one-hot encoded vectors (the number of items consumed by the user removing the target item), each associated with an item \(j \in I_u \mid j \ne i\). In other words, the network receives an item and must predict another item consumed by the user, approximating them in the vector space. For every pair of input and output items, the network calculates the prediction error and updates the weights. The main goal of the network is to maximize the objective function given by Eq. 2.

$$\begin{aligned} \frac{1}{|U|}\sum _{u\in U}\sum _{i\in I_u} \sum _{\overset{j\in I_u}{j\ne i}} log \; \sigma (i, j) \end{aligned}$$
(2)

At the end of the learning phase, the weight matrix \(\mathcal {W}\) will contain dense vector representations of dimensionality M for every item processed.

Item2Vec come with some drawbacks that would be impractical in scenarios with many items, since it must predict a related item (when output is 1) and all unrelated items (when output is 0) for each target item. Thus, two strategies commonly used in NLP are adopted [24]: (i) negative sampling, i.e., the error is calculated over a small subset of N negative items, randomly drawn so that frequent items are more likely to be selected according to Eq. 3; and (ii) subsampling of frequent items, i.e., a subset of the interactions of popular items are discarded with a probability of being kept shown in Eq. 4. In both equations, function z(i) returns the frequency of item i, and the hyperparameters \(\gamma \) and \(\rho \) can be fine-tuned.

$$\begin{aligned} P(i \mid \gamma ) = \frac{z(i)^\gamma }{\sum _{j \in I}{z(j)^\gamma }} \end{aligned}$$
(3)
$$\begin{aligned} P(i \mid \rho ) = \left( \sqrt{\frac{z(i)}{\rho }} + 1 \right) \cdot \frac{\rho }{z(i)} \end{aligned}$$
(4)

3.2 Sequential Item2Vec

Standard Item2Vec assumes that all items consumed by a user are equally related, shuffling them at the beginning of training. Therefore, given a particular item, the network considers that another item consumed in the distant past (e.g., years ago) has the same similarity to the target item as an item consumed in the recent past (e.g., same day). In other words, all other items are considered contextually equal to the target item, as illustrated in Fig. 3 for the example shown in Fig. 2.

Fig. 2.
figure 2

User interactions in a chronological order.

We have adopted a sliding window of fixed and parameterizable size during training to restrict the context. In this way, only items consumed in sequence are considered to be related to the target item. In a first approach, we are not interested in the moment the interaction occurred, but only in the chronological order, as shown in Fig. 4.

The neural architecture for SeqI2V is very similar to Item2Vec, the only difference taking place in the number of vectors present in the output layer: instead of a vector for each of the other \(|I_u|-1\) items consumed by the user, the network must predict only S items, where S corresponds to the window size (Fig. 5).

Fig. 3.
figure 3

Standard Item2Vec. Every item consumed by the user is equally related to the target item.

Fig. 4.
figure 4

Temporal sliding window during training.

Fig. 5.
figure 5

Sequential Item2Vec architecture

Although this model considers the chronological order in which items are consumed, it fails to capture changes in users who spend long periods without interacting with the system. To overcome this problem, we propose splitting a user’s interaction history into various chronological sequences using absolute and pre-defined time intervals to decide where to split. This approach allows that when two continuous interactions of the same user have an interval greater than D days, their chronological sequence is then split into two. As we are only interested in the relationship between items, the network would interpret this single-user as different ones, restricting learning to its behavior in a defined time interval, as illustrated by Fig. 6. The value of the parameter controlling the number of days to define the binding time window is flexible and can be adjusted for each specific application scenario a user might want to use SeqI2V on. In this work, we opt for an absolute value of days when splitting as it is a less complex and humanly understandable strategy.

Fig. 6.
figure 6

Sequential Item2Vec. The user interaction history is split to capture the changes in its behavior.

During the learning phase and when generating the final recommendation, SeqI2V also employs all techniques used by standard Item2Vec, such as negative sampling and item subsampling during training, and also an item-based K-Nearest Neighbor when recommending.

4 Experiments and Results

In order to make the experiments reproducible, this section describes the experimental protocol conducted during the study. First, we introduce the datasets used, in conjunction with a description of how the training was carried out. Then, we present the achieved results.

4.1 Datasets

The datasets used in the experiment are listed in Table 1. All of them are publicly available and were selected based on their relevance within literature. DeliciousBookmarks and MovieLens belong to GroupLens repositoryFootnote 2; BestBuyFootnote 3 and NetflixPrizeFootnote 4 were datasets used in challenges conducted by, respectively, Association for Computing Machinery (ACM) and Netflix, Inc.; CiaoDVDFootnote 5 was crawled from Ciao website; RetailRocket is the database of the retention management platform with the same nameFootnote 6. In all these datasets, we removed users or items with a single interaction since most of the trained algorithms aim to learn relations between same-user or item interactions.

4.2 Experimental Protocol

We sorted the interactions chronologically in each dataset and separated them into a train, validation, and test partitions, with an 8:1:1 ratio, respectively. Then, we optimized the hyperparameters using grid search on the validation partition, aiming to maximize NDCG in a top-15 recommendation scenario. Finally, the final results were obtained on the test partition. Due to hardware limitations, we performed the parameter optimization/final experiment of larger datasets over a small and randomly comprised subset: 25%/100% for MovieLens and 5%/25% for NetflixPrize. Additionally, we removed cold-start cases, i.e., users or items with no interaction on the training set, since this is a problem beyond the scope of our study.

Table 1. Description of the datasets in terms of number of users (|U|), number of items (|I|), number of interactions (|R|), sparsity (E), and application domain.

Although our main goal is to improve the results of Item2Vec, we also compared the performance achieved by the proposed Sequential Item2Vec (SeqI2V) with classical recommender algorithms. For this, we have implemented the established item-based version of K-Nearest Neighbors (KNN) [28] and two state-of-the-art matrix factorization methods: implicit singular value decomposition (ALS) [16] and Bayesian Personalized Ranking (BPR) [26]. In addition to the standard Item2Vec (I2V), we have also implemented User2Vec (U2V) [13], a user-item neural embeddings-method.

For KNN we used the implementation of turicreate library, without requiring hyperparameter choices. Both matrix factorization methods were implemented using implicit library, with latent factors ranging in \(\{50, 100, 200\}\), regularization factor in \(\{10^{-6}, 10^{-4}, 10^{-2}\}\), number of epochs in \(\{10, 20, 50, 100\}\), and learning rate in \(\{0.0025, 0.025, 0.25\}\). All neural embeddings models, including Sequential Item2Vec, were implemented using gensim library, with hyperparameter being tuned according insights proposed by Caselles-Duprés et al. [8]. Thus, the dimensionality M was fixed in 100 and 5 negative samples were randomly drawn for each item, with the number of epochs ranging in \(\{50, 100, 150\}\), negative exponent for the negative sampling in \(\{-1.0, -0.5, 0.5, 1.0\}\), and the subsampling factor in \(\{10^{-5}, 10^{-4}, 10^{-3}\}\). Additionally, for the sequential adaptation, we tested different window sizes ranging in \(\{1,3,5,7,9\}\), and the interval of days for splitting in \(\{182, 365, 730, \text {no split}\}\).

Table 2. F1-Score obtained in a top-15 recommendation problem.
Table 3. NDCG obtained in a top-15 recommendation problem.

4.3 Results

We evaluated the algorithms in a top-15 recommendation task, i.e. for each user, the methods must recommend 15 items in a ranked list. Then, we calculated the F1-Score (Table 2) and the Normalized Discounted Cumulative Gain, or NDCG (Table 3). All results are in grayscale, with darker tones representing better performances. The best value for each dataset is highlighted in bold.

To facilitate comparing the SeqI2V with its baseline (Item2Vec), Table 4 presents the improvements and worsenings resulting from the use of SeqI2V concerning the results obtained by Item2Vec in each database and performance measure. In both metrics, the results obtained by Sequential Item2Vec are promising. In all datasets, except for BestBuy, the SeqI2V outperformed the I2V. On the RetailRocket dataset, the performance of SeqI2V was approximately 150% higher. The F1-Score achieved by SeqI2V was 52% and 43% higher than Item2Vec, in average and median, respectively. Moreover, the NDCG achieved by SeqI2V was 48% and 55% higher, in average and median.

When compared to the other established recommender algorithms, the results of Sequential Item2Vec remain auspicious. For the F1-Score, it attained the best result in 66% of the datasets. On the other hand, as BPR optimizes the ranking, it achieved the best performance when NDCG is used as an evaluation metric. Still, for both metrics, SeqI2V was the best or second-best in 83% of the cases.

It is important to highlight that regarding the number of days used to split the temporal sequences, RetailRocket was the only dataset in which sequences with no split achieved the best performance. For all the others datasets, the best performance was attained using sequences split between 182 and 730 days. These results evidence the effectiveness of splitting user sequences and the need to adapt the interval for each specific application. Moreover, even when the split was not necessary, using sliding windows significantly improved the results.

It is also important to highlight that, although SeqI2V demands an extra step of data preparation, training the embeddings is less costly than in Item2Vec. For each target item, it makes predictions only for a reduced set of items (the window size) instead of all items previously consumed by the user. Thus, the proposed method is more computationally efficient and presents better results without increasing implementation complexity.

Table 4. Improvements and worsenings provided by Sequential Item2Vec compared to Item2Vec.

The main limitation of the proposed method relies on exceptional cases in which the user always takes a long period between two interactions, greater than the defined window size. In this case, consumption contexts would be composed of unitary relationships, allowing sliding windows but making the interactions sequence splitting unfeasible.

5 Conclusion

Item2Vec is a well-known method capable of learning neural embeddings for items in a recommender system. This study presented Sequential Item2Vec – an adaptation of the original approach that uses a sliding window over time, giving temporal context awareness for the item embeddings. We also proposed splitting the user’s interaction history into multiple sequences so that the network would not learn relationships between items consumed in considerably different moments, thus capturing changes in the user’s behavior.

In order to assess whether the proposed approach was able to improve the results obtained with the standard Item2Vec, we compared both models with other established methods in the field, over six well-known and publicly available databases. Metrics of hit and ranking were calculated in a top-15 recommendation scenario. The results indicate that Sequential Item2Vec outperformed the results obtained by Item2Vec. It was superior to its predecessor in almost all of the datasets. The proposed approach was also competitive compared to the other methods, matching the Bayesian Personalized Ranking, a state-of-the-art matrix factorization method.

Although giving time awareness to the item embeddings, the recommendation generated by Sequential Item2Vec is still statically. In future research, we aim to consider time in the recommendation step, analyzing its impact. Additionally, we suggest studying other ways to include time during the learning phase. One promising approach can be the use of decay functions instead of - or in conjunction with - sliding windows.