key: cord-0582638-m7w7aue8
authors: Wu, Liwei
title: Advances in Collaborative Filtering and Ranking
date: 2020-02-27
journal: nan
DOI: nan
sha: c405c06f34d9d5a59b5e30404db7f429432ed8d7
doc_id: 582638
cord_uid: m7w7aue8

In this dissertation, we cover some recent advances in collaborative filtering and ranking. In chapter 1, we give a brief introduction of the history and the current landscape of collaborative filtering and ranking; chapter 2 we first talk about pointwise collaborative filtering problem with graph information, and how our proposed new method can encode very deep graph information which helps four existing graph collaborative filtering algorithms; chapter 3 is on the pairwise approach for collaborative ranking and how we speed up the algorithm to near-linear time complexity; chapter 4 is on the new listwise approach for collaborative ranking and how the listwise approach is a better choice of loss for both explicit and implicit feedback over pointwise and pairwise loss; chapter 5 is about the new regularization technique Stochastic Shared Embeddings (SSE) we proposed for embedding layers and how it is both theoretically sound and empirically effectively for 6 different tasks across recommendation and natural language processing; chapter 6 is how we introduce personalization for the state-of-the-art sequential recommendation model with the help of SSE, which plays an important role in preventing our personalized model from overfitting to the training data; chapter 7, we summarize what we have achieved so far and predict what the future directions can be; chapter 8 is the appendix to all the chapters.

List of Tables [8, 74] ants. We did not report SSE-PT++ results for beauty, games and steam, as the input sequence lengths are very short (see Table 8 The synthesis dataset has 10, 000 users and 2, 000 items with user friendship graph of size 10, 000 × 10, 000. Note that the graph only contains at most 6-hop valid information. GRMF G 6 means GRMF with G + α · G 2 + β · G 3 + γ · G 4 + · G 5 + ω · G 6 . GRMF DNA-d means depth d is used. . . . 

In this dissertation, we cover some recent advances in collaborative filtering and ranking.

In chapter 1, we give a brief introduction of the history and the current landscape of collaborative filtering and ranking; chapter 2 we first talk about pointwise collaborative filtering problem with graph information, and how our proposed new method can encode very deep graph information which helps four existing graph collaborative filtering algorithms; chapter 3 is on the pairwise approach for collaborative ranking and how we speed Ask yourself, if today were the last day, what is the most important thing that you want to do? Then just follow your heart and I believe everyone can achieve a meaningful life.

Nowadays in online retail and online content delivery applications, it is commonplace to have embedded recommendation systems algorithms that recommend items to users based on previous user behaviors and ratings. The field of recommender systems has gained more and more popularity ever since the famous Netflix competition [7] , in which competitors utilize user ratings to predict ratings for each user-movie pair and the final winner takes home 1 million dollars. During the competition, 2 distinct approaches stand out: one being the Restricted Boltzmann Machines [93] and the other being matrix factorization [61, 73] .

The combination of both approaches work well during the competition, but due to the ease of training and inference, matrix factorization approaches have dominated the collaborative filtering field before the widespread adoption of deep learning methods [95] . Collaborative filtering refers to making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). Usually this does not require approaching the recommendation problem as a ranking problem but rather pretends it is a regression or classification problem. Of course, there may be some loss incurred when using a regression or classification loss for ultimately what is a ranking problem. Because at the end of day, the ordering is the most important thing, which is directly associated with the recommender system performance. Collaborative ranking approaches mitigate the concerns by using a ranking loss. The ranking loss can be either [8, 89] and temporal information [43, 56] . The ways of constructing graphs can vary. One way to define graph over users is to exploit the friendship relationships among friends. But there are many ways to define such graphs. If there is more than one type of relationship, then we call it knowledge graphs instead of graphs. Knowledge graphs widely exist: for example, between 2 movies, they could be starred by the same actors or they could be the same genre of movies. In the field of collaborative filtering List-wise Approach [50, 121] 

This dissertation summarizes several published and under-review works that advance the field of collaborative filtering and ranking. Our first main contribution is that we fill out the void in [115] . We not only contribute to the fundamental approaches but also to extended approaches that utilize extra information,

including graph and temporal ordering information. On one hand, we propose a novel way to encode long range graph interactions without require any training using bloom filters as backbone [119] . On the other hand, with the help of a new embedding-layer regularization called Stochastic Shared Embeddings (SSE) [117] , we can also introduce personalization for the state-of-the-art sequential recommendation model and achieve much better ranking performance with our personalized model [118] , where personalization is crucial for the success of recommender systems unlike most natural language tasks. This new regularization not only helps existing collaborative filtering and collaborative ranking algorithms but also benefits methods in natural language processing in fields like machine translation and sentiment analysis [117] . 

In this chapter, we consider the Collaborative Ranking (CR) problem for recommendation systems. Given a set of pairwise preferences between items for each user, collaborative ranking can be used to rank un-rated items for each user, and this ranking can be naturally used for recommendation. It is observed that collaborative ranking algorithms usually achieve better performance since they directly minimize the ranking loss; however, they are rarely used in practice due to the poor scalability. 

In this chapter, we propose a listwise approach for constructing user-specific rankings in recommendation systems in a collaborative fashion. We contrast the listwise approach to previous pointwise and pairwise approaches, which are based on treating either each rating or each pairwise comparison as an independent instance respectively. By extending the work of [15] , we cast listwise collaborative ranking as maximum likelihood under a permutation model which applies probability mass to permutations based on a low rank latent score matrix. We present a novel algorithm called SQL-Rank, which can accommodate ties and missing data and can run in linear time. We develop a theoretical framework for analyzing listwise ranking methods based on a novel representation theory for the permutation model. 

Collaborative Filtering with Graph Encoding

Recommendation systems are increasingly prevalent due to content delivery platforms, e-commerce websites, and mobile apps [98] . Classical collaborative filtering algorithms use matrix factorization to identify latent features that describe the user preferences and item meta-topics from partially observed ratings [61] . In addition to rating information, many real-world recommendation datasets also have a wealth of side information in the form of graphs, and incorporating this information often leads to performance gains [67, 89, 128] .

However, each of these only utilizes the immediate neighborhood information of each node in the side information graph. More recently, [8] incorporated graph information when learning features with a Graph Convolution Network (GCN) based recommendation algorithm. GCNs [58] constitute flexible methods for incorporating graph structure beyond first-order neighborhoods, but their training complexity typically scales rapidly with the depth, even with sub-sampling techniques [18] . Intuitively, exploiting higher-order neighborhood information could benefit the generalization performance, especially when the graph is sparse, which is usually the case in practice. The main caveat of exploiting higher-order graph information is the high computational and memory cost when computing higher-order neighbors since the number of t-hop neighbors typically grows exponentially with t.

We aim to utilize higher order graph information without introducing much computational and memory overhead. We propose a Graph Deep Neighborhood Aware (Graph DNA) encoding, which approximately captures the higher-order neighborhood information of each node via Bloom filters [9] . Bloom filters encode neighborhood sets as c dimensional 0/1 vectors, where c = O(log n) for a graph with n nodes, which approximately preserves membership information. This encoding can then be combined with both graph regularized or feature based collaborative filtering algorithms, with little computational and memory overhead. In addition to computational speedups, we find that Graph DNA achieves better performance over competitors. We show that our Graph DNA encoding can be used with several collaborative filtering algorithms: graph-regularized matrix factorization with explicit and implicit feedback [89, 128] , co-factoring [67] , and GCN-based recommendation systems [74] . In some cases, using information from deeper neighborhoods (like 4 th order)

yields a 15x increase in performance, with graph DNA encoding yielding a 6x speedup compared to directly using the 4 th power of the graph adjacency matrix.

Matrix factorization has been used extensively in recommendation systems with both explicit [61] and implicit [49] feedback. Such methods compute low dimensional user and item representations; their inner product approximates the observed (or to be predicted) entry in the target matrix. To incorporate graph side information in these systems, [89, 128] used a graph Laplacian based regularization framework that forces a pair of node representations to be similar if they are connected via an edge in the graph. In [126] , this was extended to the implicit feedback setting. [67] proposed a method that incorporates first-order information of the rating bipartite graph into the model by considering item co-occurrences. More recently, GC-MC [8] used a GCN approach performing convolutions on the main bipartite graph by treating the first-order side graph information as features, and [74] proposed combining GCNs and RNNs for the same task.

Methods that use higher order graph information are typically based on taking random walks on the graphs [31] . [52] extended this method to include graph side information in the model. Finally, the PageRank [76] algorithm can be seen as computing the steady state distribution of a Markov network, and similar methods for recommender systems was proposed in [1, 122] .

For a complete list of related works of representation learning on graphs, we refer the interested user to [36] . For the collaborative filtering setting, [8, 74] use Graph Convolutional Neural Networks (GCN) [23] , but with some modifications. Standard GCN methods without substantial modifications cannot be directly applied to collaborative filtering rating datasets, including well-known approaches like GCN [58] and GraphSage [35] , because they are intended to solve semi-supervised classification problem over graphs with nodes' features. PinSage [123] is the GraphSage extension to non-personalized graphbased recommendation algorithm but is not meant for collaborative filtering problems.

GC-MC [8] extends GCN to collaborative filtering, albeit it is less scalable than [123] .

Our Graph DNA scheme can be used to obtain graph features in these extensions. In contrast to the above-mentioned methods involving GCNs, we do not use the data driven loss function to train our graph encoder. This property makes our graph DNA suitable for both transductive as well as inductive problems.

Bloom filters have been used in Machine Learning for multi-label classification [21] , and for hashing deep neural network models representations [22, 37, 99] . However, to the best of our knowledge, until now, they have not been used to encode graphs, nor has this encoding been applied to recommender systems. So it would be interesting to extend our work to other recommender systems settings, such as [118] and [117] .

We consider the recommender system problem with a partially observed rating matrix R and a Graph that encodes side information G. In this section, we will introduce the Graph DNA algorithm for encoding deep neighborhood information in G. In the next section, we will show how this encoded information can be applied to various graph based recommender systems.

The Bloom filter [9] is a probabilistic data structure designed to represent a set of elements.

Thanks to its space-efficiency and simplicity, Bloom filters are applied in many real-world applications such as database systems [10, 16] . A Bloom filter B consists of k independent hash functions h t (x) → {1, . . . , c}. The Bloom filter B of size c can be represented as a length c bit-array b. More details about Bloom filters can be found in [12] . Here we highlight a few desirable properties of Bloom filters essential to our graph DNA encoding: 

is the shortest path distance between nodes i and j in G. As the last step, we stack array representations of all Bloom filters and form a sparse matrix B ∈ {0, 1} n×c , where the i-th row of B is the bit representation of B[i]. As a practical measure, to prevent over-saturation of Bloom filters for popular nodes in the graph, we add a hyper-parameter θ to control the max saturation level allowed for Bloom filters. This would also prevent hub nodes dominating in graph DNA encoding. The pseudo-code for the proposed encoding algorithm is given in Algorithm 1. We use graph DNA-d to denote our obtained graph encoding after applying Algorithm 1 with s looping from 1 to d. We also give a simple example to illustrate how the graph DNA is encoded into Bloom filter representations in 

Suppose we are given the sparse rating matrix R ∈ R n×m with n users and m items, and a graph G ∈ R n×n encoding relationships between users. For simplicity, we do not assume a graph on the m items, though including it should be straightforward. 

Explicit Feedback : The objective function of Graph Regularized Matrix Factorization (GRMF) [13, 89, 128] is:

where U ∈ R n×r , V ∈ R m×r are the embeddings associated with users and items respectively, n is the number of users and m is the number of items, R ∈ R n×m is the sparse rating matrix, tr() is the trace operator, λ, µ are tuning coefficients, and Lap(·) is the graph Laplacian operator.

The last term is called graph regularization, which tries to enforce similar nodes (measured by edge weights in G) to have similar embeddings. One naive way [14] to extend this to higher-order graph regularization is to replace the graph G with K i=1 w i · G i and then use the graph Laplacian of K i=1 w i · G i to replace G in (2.1). Computing G i for even small i is computationally infeasible for most real-world applications, and we will soon lose the sparsity of the graph, leading to memory issues. Sampling or thresholding could mitigate the problem but suffers from performance degradation.

In contrast, our graph DNA obtained from Algorithm 1 does not suffer from any of these issues. The space complexity of our method is only of order O(n log n) for a graph with n nodes, instead of O(n 2 ). The reduced number of non-zero elements using graph DNA leads to a significant speed-up in many cases.

We can easily use graph DNA in GRMF as follows: we treat the c bits as c new pseudo-nodes and add them to the original graph G. We then have n + c nodes in a modified graphĠ:

To account for the c new nodes, we expand U ∈ R n×r toU ∈ R (n+c)×r by appending parameters for the meta-nodes. The objective function for GRMF with Graph DNA with be the same as (2.1) except replacing U and G withU andĠ. At the prediction stage, we discard the meta-node embeddings.

Implicit Feedback : For implicit feedback data, when R is a 0/1 matrix, weighted matrix factorization is a widely used algorithm [48, 49] . The only difference is that the

where ρ < 1 is a hyper-parameter reflecting the confidence of zero entries. In this case, we can apply the Graph DNA encoding as before trivially.

Co-Factorization of Rating and Graph Information (Co-Factor) [67, 103] is ideologically very different from GRMF and GRWMF, because it does not use graph information as regularization term. Instead it treats the graph adjacency matrix as another rating matrix, sharing one-sided latent factors with the original rating matrix. Co-Factor minimizes the following objective function:

We can extend Co-Factor to incorporate our DNA-d by replacing G with B in the equation above, where B ∈ R n×c is the Bloom filter bipartite graph adjacency matrix of n real-user nodes and c pseudo-user nodes, similar to B as in (2.2). We call the extension Co-Factor DNA-d.

Graph Convolutional Matrix Completion (GC-MC) is a graph convolutional network (GCN) based geometric matrix completion method [8] . In [8] , the rating matrix R is treated as an adjacency matrix in GCN while side information G is treated as feature matrix for nodes -each user has an n-dimensional 0/1 feature that corresponds to a column of G. The GCN model then performs convolutions of these features on the bipartite rating graph. Convolutions of these features are performed on the bipartite rating graph.

We find in our experiments that using these one-hot encodings of the graph as feature is an inferior choice both in terms of performance and speed. To capture higher order side graph information, it is better to use G + αG 2 for some constant α and this alternate choice usually gives smaller generalization error than the original GC-MC method. However, it is hard to explicitly calculate G + αG 2 and store the entire matrix for a large graph for the same reason described in Section 2.4.1. Again, we can use graph DNA to efficiently encode and store the higher order information before feeding it into GC-MC. We show in our experiments that this outperforms current state-of-the-art GCN methods [8, 74] as well as GC-MC with graph encoding methods that require training, such as Node2vec [32] and Deepwalk [84] . Our encoding scheme does not require training and therefore is a lot faster than previous encoding methods. More details are discussed in the experiment section 2.5.3.

We show that our Graph DNA encoding technique can improve the performance of 4 popular graph-based recommendation algorithms: graph-regularized matrix factorization, 

We first simulate a user/item rating dataset with user graph as side information, generate its graph DNA, and use it on a downstream task: matrix factorization.

We randomly generate user and item embeddings from standard Gaussian distributions, and construct an Erdős-Rényi Random graph of users. User embeddings are generated using Algorithm 11 in Appendix: at each propagation step, each user's embedding is updated by an average of its current embedding and its neighbors' embeddings. Based on user and item embeddings after T = 3 iterations of propagation, we generate the underlying ratings for each user-item pairs according to the inner product of their embeddings, and then sample a small portion of the dense rating matrix as training and test sets.

We implement our graph DNA encoding algorithm in python using a scalable python library [3] to generate Bloom filter matrix B. We adapt the GRMF C++ code to solve the objective function of GRMF DNA-K with our Bloom filter enhanced graphĠ. We compare the following variants:

1. MF: classical matrix factorization only with 2 regularization without graph information.

2. GRMF G d : GRMF with 2 regularization and using G, G 2 , . . . , G d [14] .

3. GRMF DNA-d: GRMF with 2 but using our proposed graph DNA-d.

We report the prediction performance with Root Mean Squared Error (RMSE) on test data. All results are reported on the test set, with all relevant hyperparameters tuned on a held-out validation set. To accurately measure how large the relative gain is from using deeper information, we introduce a new metric called Relative Graph Gain (RGG) for using information X, which is defined as:

where RMSE is measured for the same method with different graph information. This metric would be 0 if only first order graph information is utilized and is only defined when the denominator is positive.

In Table 2 .1, we can easily see that using a deeper neighborhood helps the recommendation performances on this synthetic dataset. Graph DNA-3's gain is 166% larger than that of using first-order graph G. We can see an increase in performance gain for an increase in depth d when d ≤ 3. This is expected because we set T = 3 during our creation of this dataset.

Next, we show that graph DNA can improve the performance of GRMF for explicit feedback. We conduct experiments on two real datasets: Douban [70] and Flixster [127] .

Both datasets contain explicit feedback with ratings from 1 to 5. We pre-processed Douban and Flixster following the same procedure in [89, 115] . The experimental setups and comparisons are almost identical to the synthetic data experiment (see details in section 2.5.1). Due to the exponentially growing non-zero elements in the graph as we go deeper (see Table 8 .2), we are unable to run full GRMF G 4 and GRMF G 5 for these datasets. In fact, GRMF G 3 itself is too slow so we thresholded G 3 by only considering entries whose values are equal to or larger than 4. For the Bloom filter, we set a false positive rate of 0.1 and use capacity of 500 for Bloom filters, resulting in c = 4, 796.

We can see from Table 2 .1 that deeper graph information always helps. For Douban, graph DNA-3 is most effective, giving a relative graph gain of 82.79% compared to only 2% gain when using G 2 or G 3 naively. Interestingly for Flixster, using G 2 is better than using G 3 . However, Graph DNA-3 and DNA-4 yield 10x and 15x performance improvements respectively, lending credence to the implicit regularization property of graph DNA. For a fixed size Bloom filter, the computational complexity of graph DNA scales linearly with depth d, as compared to exponentially for GRMF G d . We measure the speed in 

We show our graph DNA can improve Co-Factor [67, 103] as well. The results are in Table 2 .1. We find that applying DNA-3 to the Co-Factor method improves performance on both the datasets, more so for Flixster. This is consistent with our observations for GRMF in Table 2 .1: deep graph information is more helpful for Flixster than Douban.

Applying Graph DNA to Co-Factor is detailed in the Appendix. [8, 74] ). All the methods except GC-MC utilize side graph information. 

We follow the same procedure as in [116] to set ratings of 4 and above to 1, and the rest to 0. We compare the baseline graph based weighted matrix factorization [48, 49] with our proposed weighted matrix factorization with DNA-3. We do not compare with Bayesian personalized ranking [91] and the recently proposed SQL-rank [116] as they cannot easily utilize graph information.

The results are summarized in Table 2 .3 with experimental details in the Appendix.

Again, using DNA-3 achieves better prediction results over the baseline in terms of every single metric on both Douban and Flixster datasets.

We can use graph DNA instead to efficiently encode and store the higher order information before feeding it into GC-MC.

We use the same split of three real-world datasets and follow the exact procedures as in [8, 74] . We tuned hyperparameters using a validation dataset and obtain the best test results found within 200 epochs using optimal parameters. We repeated the experiments 6 times and report the mean and standard deviation of test RMSE. After some tuning, we use the capacity of 10 Bloom filters for Douban and 60 for Flixster, as the latter has a much denser second-order graph. With a false positive rate of 0.1, this implies that we use 96-bits Bloom filters for Douban and 960 bits for Flixster. We use the resulting bloom filter bitarrays as the node features, and pass that as the input to GC-MC. Using Graph DNA-2, the input feature dimensions are thus reduced from 3000 to 96 and 960, which leads to a significant speed-up. The original GC-MC method did not scale up well beyond 3000 by 3000 rating matrices with the user and the item side graphs as it requires using normalized adjacency matrix as user/item features. PinSage [123] , while scalable, does not utilize the user/item side graphs. Furthermore, it is not feasible to have O(n) dimensional features for the nodes, where n is the number of nodes in side graphs. In contrast, our method only requires O(log(n)) dimensional features. We can see from Table 2 .4 that we outperform both GCN-based methods [8] and [74] in terms of performance by a large margin.

Note that another potential way to improve over GC-MC is to use other graph encoding schemes like Node2Vec [32] and DeepWalk [84] to encode the user-user graph into node features. One clear drawback is that those graph embedding methods are time-consuming.

Using the official Node2vec implementation, excluding reading and writing, it takes 416.13 seconds to encode the 3K by 3K subsampled Yahoo-Music item graph and obtain resulting 760-d node embeddings. For our method, it only takes 7.55 seconds to obtain the same 760d features. Similarly, it takes over 15 mins to run the official C++ codes for DeepWalk [84] using the same parameters as Node2Vec to encode the graph. In fact, fast encoding via hashing and bitwise-or that does not require training is one of the main advantages of our method.

Furthermore, even without considering the time overhead, we found our graph DNA encoding outperforms Node2Vec and DeepWalk in terms of test RMSE. Details can be found in Table 2 .4. This could be due to that encoding higher-order information is more important for graph-regularized recommendation tasks, and graph DNA is a better and more direct way to encode higher order information compared with Node2Vec and DeepWalk.

Speed Comparisons Next, we compare the speed-ups obtained by graph DNA-d with GRMF G d (a naive way to encode higher order information by computing powers of G). Figure 3 suggests that graph DNA-1 (which encodes hop-2 information) scales better than directly computing G 2 in GRMF.

Exploring Effects of Rank Finally, we investigate whether the proposed DNA coding can achieve consistent improvements when varying the rank in the GRMF algorithm. In Table 2 .5, we compare the proposed GRMF DNA-3 with GRMF G 2 , which achieves the best RMSE without using DNA coding in the previous tables. The results clearly show that the improvement of the proposed DNA coding is consistent over different ranks and works even better when rank is larger.

In this chapter, we proposed Graph DNA, a deep neighborhood aware encoding scheme for collaborative filtering with graph information. We make use of Bloom filters to incorporate higher order graph information, without the need to explicitly minimize a loss function. The resulting encoding is extremely space and computationally efficient, and lends itself well to multiple algorithms that make use of graph information, including Graph Convolutional

Networks. Experiments show that Graph DNA encoding outperforms several baseline methods on multiple datasets in both speed and performance.

Large-scale Pairwise Collaborative Ranking in Near-Linear Time

In online retail and online content delivery applications, it is commonplace to have embedded recommendation systems-algorithms that recommend items to users based on previous user behaviors and ratings. Online retail companies develop sophisticated recommendation systems based on purchase behavior, item context, and shifting trends. The

Netflix prize [7] , in which competitors utilize user ratings to recommend movies, accelerated research in recommendation systems. While the winning submissions agglomerated several existing methods, one essential methodology, latent factor models, emerged as a critical component. The latent factor model means that the approximated rating for user i and item j is given by

One interpretation is that there are k latent topics and the approximated rating can be reconstructed as a combination of factor weights. By minimizing the square error loss of this reconstruction we arrive at the incomplete SVD,

where Ω contains sampled indices of the rating matrix, R.

Often the performance of recommendation systems is not measured by the quality of rating prediction, but rather the ranking of the items that the system returns for a given user. The task of finding a ranking based on ratings or relative rankings is called Collaborative Ranking. Recommendation systems can be trained with ratings, that may be passively or actively collected, or by relative rankings, in which a user is asked to rank a number of items. A simple way to unify the framework is to convert the ratings into rankings by making pairwise comparisons of ratings. Specifically, the algorithm takes as input the pairwise comparisons, Y i,j,k for each user i and item pairs j, k.

This approach confers several advantages. Users may have different standards for their ratings, some users are more generous with their ratings than others. This is known as the calibration drawback, and to deal with this we must make a departure from standard matrix factorization methods. Because we focus on ranking and not predicting ratings, we can expect improved performance when recommending the top items. Our goal in this chapter is to provide a collaborative ranking algorithm that can scale to the size of the full Netflix dataset, a heretofore open problem.

The existing collaborative ranking algorithms, (for a summary see section 3.2), are limited by the number of observed ratings per user in the training data and cannot scale to massive datasets, therefore, making the recommendation results less accurate and less useful in practice. This motivates our algorithm, which can make use of the entire Netflix dataset without sub-sampling. Our contribution can be summarized below:

• For input data in the form of pairwise preference comparisons, we propose a new algorithm Primal-CR that alternatively minimizes latent factors using Newton's method in the primal space. By carefully designing the computation of gradient and Hessian vector product, our algorithm reduces the sample complexity per iteration to O(|Ω| + d 1d2 r), while the state-of-the-art approach [81] have O(|Ω|r) complexity.

Here |Ω| (total number of pairs), is much larger than d 1d2 (d 1 is number of users andd 2 is averaged number of items rated by a user). For the Netflix problem, |Ω| = 2 × 10 10 while d 1d2 = 10 8 .

• For input data in the form of ratings, we can further exploit the structure to speedup the gradient and Hessian computation. The resulting algorithm, Primal-CR++, can further reduce the time complexity to O(d 1d2 (r + logd 2 )) per iteration. In this setting, our algorithm has time complexity near-linear to the input size, and have comparable speed with classical matrix factorization model that takes O(d 1d2 r) time, while we can achieve much better recommendation by minimizing the ranking loss.

We show that our algorithms outperform existing algorithms on real world datasets and can be easily parallelized.

Collaborative filtering methodologies are summarized in [95] (see [24] for an early work).

Among them, matrix factorization [61] has been widely used due to the success in the Netflix Prize. Many algorithms have been developed based on matrix factorization [19, 48, 90, 91, 102] , and many scalable algorithms have been developed [29, 61] . However, they are not suitable for ranking top items for a user due to the fact that their goal is to minimize the mean-square error (MSE) instead of ranking loss. In fact, MSE is not a good metric for recommendation when we want to recommend the top K items to a user. This has been pointed out in several papers [5] which argue normalized discounted cumulative gain (NDCG) should be used instead of MSE, and our experimental results also confirm this finding by showing that minimizing the ranking loss results in better precision and NDCG compared with the traditional matrix factorization approach that is targeting squared error.

Ranking is a well studied problem, and there has been a long line of research focuses on learning one ranking function, which is called Learning to Rank. For example, RankSVM [53] is a well-known pair-wise model, and an efficient solver has been proposed in [17] for solving rankSVM. [15] is a list-wise model implemented using neural networks.

Another class of point-wise models fit the ratings explicitly but has the issue of calibration drawback (see [34] ).

The collaborative ranking (CR) problem is essentially trying to learn multiple rankings together, and several models and algorithms have been proposed in literature. The Cofirank algorithm [114] , which tailors maximum margin matrix factorization [105] for collaborative ranking, is a point-wise model for CR, and is regarded as the performance benchmark for this task. If the ratings are 1-bit, a weighting scheme is proposed to improve the usual point-wise Matrix Factorization approach [78] . List-wise models for Learning to Rank can also be extended to many rankings setting, [100] . However it is still quite similar to a point-wise approach since they only consider the top-1 probabilities.

For pairwise models in collaborative ranking, it is well known that they do not encounter the calibration drawback as do point-wise models, but they are computationally intensive and cannot scale well to large data sets [100] . The scalability problem for pairwise models is mainly due to the fact that their time complexity is at least proportional to |Ω|, the number of pairwise preference comparisons, which grows quadratically with number of rated items for each user. Recently, [81] proposed a new Collrank algorithm, and they showed that Collrank has better precision and NDCG as well as being much faster compared with other CR methods on real world datasets, including Bayesian Personalized Ranking (BPR) [91] .

Unfortunately their scalability is still constrained by number of pairs, so they can only run on subsamples for large datasets, such as Netflix. In this chapter, our algorithm

Primal-CR and Primal-CR++ also belong to the family of pairwise models, but due to cleverly re-arranging the computation, we are able to have much better time complexity than existing ones, and as a result our algorithm can scale to very large datasets.

There are many other algorithms proposed for many rankings setting but none of these mentioned below can scale up to the extent of using all the ratings in the full Netflix data.

There are a few using Bayesian frameworks to model the problem [91] , [79] , [111] , the last of which requires many specified parameters. Another one proposed retargeted matrix factorization to get ranking by monotonically transforming the ratings [62] . [33] proposes a similar model without making generative assumptions on ratings besides assuming low-rank and correctness of the ranking order.

We first formally define the collaborative ranking problem using the example of item recommender system. Assume we have d 1 users and d 2 items, the input data is given in the form of "for user i, item j is preferred over item k" and thus can be represented by a set of tuples (i, j, k). We use Ω to denote the set of observed tuples, and the observed pairwise preferences are denoted as {Y ijk | (i, j, k) ∈ Ω}, where Y ijk = 1 denotes that item j is preferred over item k for a particular user i and Y ijk = −1 to denote that item k is preferred over item j for user i.

The goal of collaborative ranking is to rank all the unseen items for each user i based on these partial observations, which can be done by fitting a scoring matrix X ∈ R d 1 ×d 2 .

If the scoring matrix has X ij > X ik , it implies that item j is preferred over item k by the particular user i and therefore we should give higher rank for item j than item k. After we estimate the scoring matrix X by solving the optimization problem described below, we can then recommend top k items for any particular user.

The Collaborative Ranking Model referred to in this chapter is the one proposed recently in [81] . It belongs to the family of pairwise models for collaborative ranking because it uses pairwise training losses [5] . The model is given as

where L(.) is the loss function, X * is the nuclear norm regularization defined by the sum of all the singular value of the matrix X, and λ is a regularization parameter. The ranking loss defined in the first term of (3.2) penalizes the pairs when Y ijk = 1 but X ij − X ik is positive but small, and penalizes even more when the difference is negative. The second term in the loss function is based on the assumption that there are only a small number of latent factors contributing to the users' preferences which is analogous to the idea behind incomplete SVD for matrix factorization mentioned in the introduction. In general we can use any loss function, but since

gives the best performance in practice [81] and enjoys many nice properties, such as smoothness and differentiable, we will focus on L 2 -hinge loss in this chapter. In fact, our first algorithm Primal-CR can be applied to any loss function, while Primal-CR++ can only be applied to L 2 -hinge loss.

Despite the advantage of the objective function in equation (3.2) being convex, it is still not feasible for large-scale problems since d 1 and d 2 can be very large so that the scoring matrix X cannot be stored in memory, not to mention how to solve it. Therefore, in practice people usually transform (3.2) to a non-convex form by replacing X = U V T , and in that case since

We use u i and v j denote columns of U and V respectively. Note that [81] also solves the non-convex form (3.4) in their experiments, and in the rest of the paper we will propose a faster algorithm for solving (3.4).

Although collaborative ranking assumes that input data is given in the form of pairwise comparisons, in reality almost all the datasets (Netflix, Yahoo-Music, MovieLens, etc)

contain user ratings to items in the form of {R ij | (i, j) ∈Ω}, whereΩ is the subset of observed user-item pairs. Therefore, in practice we have to transform the rating-based data into pair-wise comparisons by generating all the item pairs rated by the same user:

∈Ω} is the set of items rated by user i. Assume there are averagelȳ d 2 items rated by a user (i.e.,d 2 = mean(|Ω i |)), then the collaborative ranking problem will have O(d 1d2 2 ) pairs and thus the size of Ω grows quadratically.

Unfortunately, all the existing algorithms have O(|Ω|r) complexity, so they cannot scale to large number of items. For example, the AltSVM (or referred to as Collrank)

Algorithm in [81] will run out of memory when we subsample 500 rated items per user on Netflix dataset since its implementation 1 stores all the pairs in memory and therefore requires O(|Ω|) memory. So it cannot be used for the full Netflix dataset which has more than 20 billion pairs and requires 300GB memory space. To the best of our knowledge, no collaborative ranking algorithms have been applied to the full Netflix data set. But in real life, we hope to make use of as much information as possible to make better recommendation. As shown in our experiments later, using full training data instead of sub-sampling (such as selecting a fixed number of rated items per user) achieves higher prediction and recommendation accuracy for the same test data.

To overcome this scalability issue, we propose two novel algorithms for solving prob- 

If the input file is given in ratings, we can further reduce the time complexity to O(d 1d2 r + d 1d2 logd 2 ) using exactly the same optimization algorithm but smarter ways to compute gradient and Hessian vector product. This time complexity is much smaller than the number of comparisons |Ω| = O (d 1d   2 2 ), and we call this algorithm Primal-CR++. We will first introduce Primal-CR in Section 3.4.2, and then present Primal-CR++ in Section 3.4.3.

In the first setting, we consider the case where the pairwise comparisons {Y ijk | (i, j, k) ∈ Ω} are given as input. To solve problem (3.4), we alternatively minimize U and V in the primal space (see Algorithm 2) . First, we fix U and update V, and the subproblem for V while U is fixed can be written as follows:

In [81] , this subproblem is solved by stochastic dual coordinate descent, which requires O(|Ω|r) time and O(|Ω|) space complexity. Furthermore, the objective function decreases 

procedure Fix U and update V 4:

while not converged do

Apply truncated Newton update (Algorithm 3)

procedure Fix V and update U 7:

while not converged do 8:

Apply truncated Newton update (Algorithm 3) 9: return U, V recover score matrix X for the dual problem sometimes does not imply the decrease of primal objective function value, which often results in slow convergence. We therefore propose to solve this subproblem for V using the primal truncated Newton method (Algorithm 3).

Newton method is a classical second-order optimization algorithm. For minimizing a vector-valued function f (x), Newton method iteratively updates the solution by x ←

x − (∇ 2 f (x)) −1 ∇f (x). However, the matrix inversion is usually hard to compute, so a truncated Newton method computes the update direction by solving the linear system

up to a certain accuracy, usually using a linear conjugate gradient method. If we vectorized the problem for updating V in eq (3.6), the gradient is a (rd 2 )sized vector and the Hessian is an (rd 2 )-by-(rd 2 ) matrix, so explicitly forming the Hessian is impossible. Below we discuss how to apply the truncated Newton method to solve our problem, and discuss efficient computations for each part.

When applying the truncated Newton method, the Compute the Hessian-vector product q = Hp k 8:

δ k+1 = δ k + α k p k 10:

if ||r k+1 || 2 < ||r 0 || 2 · 10 −2 then 12:

break 13:

is a R r×d 2 matrix and can be computed explicitly:

∈ Ω} is the subset of pairs that associates with user i, and e j is the indicator vector used to add the u i vector to the j-th column of the output matrix. The first derivative for L 2 -hinge loss function (3.3) is

For convenience, we define g := vec(∇f (V )) to be the vectorized form of gradient. One can easily see that computing g naively by going through all the pairwise comparisons (j, k) and adding up arrays is time-consuming and has O(|Ω|r) time complexity, which is the same with Collrank [81] .

Fast computation for gradient Fortunately, we can reduce the time complexity to O(|Ω| + d 1d2 r) by smartly rearranging the computations, so that the time is only linear to |Ω| and r, but not to |Ω|r. The method is described below.

First, for each i, the first term of (3.7) can be represented by time for each i. To compute t j , we first compute u T i v j for all j ∈d 2 (i) in O(d 2 (i)r) time, and then go through all the (j, k) pairs while keep adding the coefficient related to this pair to t j and t k . Since there is no vector operations when we go through all pairs, this step only takes O(Ω i ) time. After getting all t j , we can then conduct j∈d 2 (i) t j u i e T j in O(d 2 (i)r) time. Therefore, the overall complexity can be reduced to O(|Ω| + d 1d2 r). The pseudo code is presented in Algorithm 4.

Derivation of Hessian-vector product Now we derive the Hessian 

for all j ∈d 2 (i) do 4: precompute u T i v j and store in a vector m i

Initialize a zero array t of size d 2

for

for all j ∈d 2 (i) do 

Taking derivative again we can obtain

and the second derivative for L 2 hinge loss function is given by:

Note that if we write the full Hessian H as a (d 2 r) by (d 2 r) matrix, then ∇ 2 j,k f (V ) is an r × r block in H, where there are totally d 2 2 of these blocks. In the CG update for solving H −1 g, we only need to compute H · a for some a ∈ R d 2 r . For convenience, we also partition this a into d 2 blocks, each subvector a j has size r, so a = [a 1 ; · · · ; a j ]. Similarly we can use subscript to denote the subarray (H · a) j of the array H · a, which becomes

where E j is the projection matrix to the j-th block, indicating that we are only adding (H · a) j to the j-th block of matrix, and setting 0 elsewhere.

Algorithm 5. Primal-CR: efficient way to compute Hessian vector product

for all j ∈d 2 (i) do 4: precompute u T i a j and store it in array b

Initialize a zero array t of size d 2

for O(|Ω| + d 1d2 r) by pre-computing u T i a j and caching the coefficient using the array t. The detailed algorithm is given in Algorithm 5.

Note that in Algorithm 5, we can reuse the m (sparse array storing the current prediction) which has been pre-computed in the gradient computation (Algorithm 4), and that will cost only O(d 1d2 ) memory. Even without storing the m matrix, we can compute m in the loop of line 4 in Algorithm 5, which will not increase the overall computational complexity.

Fix V and Update U After updating V by truncated Newton, we need to fix V and update U . The subproblem for U can be written as:

Since u i , the i-th column of U , is independent from the rest of columns, equation 3.16

can be decomposed into d 1 independent problems for u i :

Eq (3.17) is equivalent to an r-dimensional rankSVM problem. Since r is usually small, the problems are easy to solve. In fact, we can directly apply an efficient rankSVM algorithm proposed in [17] to solve each r-dimensional rankSVM problem. This algorithm algorithm only needs to store size d 1 × r and d 2 × r matrices for gradient and conjugate gradient method. The m matrix in Algorithm 4 is not needed, but in practice we find it can speedup the code by around 25%, and it only takes d 1d2 ≤ |Ω| memory space (less than the input size). Therefore, our algorithm is very memory-efficient.

Before going to Primal-CR++, we discuss the time complexity of Primal-CR when the input data is the user-item rating matrix. Assumed 2 is the averaged number of rated items per user, then there will be |Ω| = O (d 1d   2 2 ) pairs, leading to O(d 1d 2 2 + d 1d2 r) time complexity for Primal-CR. This is much better than the O(d 1d 

Do another scan j fromd 2 to 1 to compute t − [π(j)] for all j 16:

Now we discuss a more realistic scenario, where the input data is a rating matrix {R ij | (i, j) ∈Ω} andΩ is the observed set of user-item ratings. We assume there are only L levels of ratings, so R ij ∈ {1, 2, . . . , L}. Also, we used 2 (i) := {j | (i, j) ∈Ω} to denote the rated items for user i.

Given this data, the goal is to solve the collaborative ranking problem ( The algorithm of Primal-CR++ is exactly the same with Primal-CR, but we use a smarter algorithm to compute gradient and Hessian vector product in near-linear time, by exploiting the structure of the input data.

We first discuss how to speed up the gradient computation of (3.7), where the main computation is to compute (3.9) for each i. When the loss function is L2-hinge loss, we can explicitly write down the coefficients t j in (3.9) by (3.18) where m j := u T i v j and I[·] is an indicator function such that I[a ≤ b] = 1 if a ≤ b, and I[a ≤ b] = 0 otherwise. By splitting the cases of Y ijk = 1 and Y ijk = −1, we get Since we scan from left to right, these numbers can be maintained in constant time at each step. Now assume we scan over the numbers m 1 + 1, m 2 + 1, . . . , then at each point we can compute

Although we observe that O(L) time is already small in practice (since L usually smaller than 10), in the following we show there is a way to remove the dependency on L by using a simple Fenwick tree [27] , F+tree [125] or segment tree. If we store the set {s 1 , . . . , s L } in Fenwick tree, then each query of ≥r s i can be done in in O(log L) time, and since each step we only need to change one element into the set, the updating time is also O(log L). Note that t − j can be computed in the same way by scanning from largest m j to the smallest one.

To sum up, the algorithm first computes all m j in O(d 2 r) time, then sort these numbers using O(d 2 logd 2 ) time, and then compute t j for all j using two linear scans in O(d 2 log L) time. Here log L is dominated by logd 2 since L can be the number of unique rating levels in the current setd 2 (i). Therefore, after computing this for all users i = 1, . . . , d 1 , the time complexity for computing gradient is

A similar procedure can also be used for computing the Hessian-vector product, and the computation of updating U with fixed V is simplier since the problem becomes Compared with the classical matrix factorization, where both ALS and SGD requires O(|Ω|r) time per iteration [61] , our algorithm has almost the same complexity, since logd 2 is usually smaller than r (typically r = 100). Also, since all the temporary memory when computing user i can be released immediately, the only memory cost is still the same with

Updating U while fixing V can be parallelized easily because each column of U is independent and we can actually solve d 1 independent subproblems at the same time. For the other side, updating V while fixing U can also be parallelized by parallelizing "computing g" part and "computing Ha" part respectively. We implemented the algorithm using parallel computing techniques in Julia by computing g and Ha distributedly and summing 

In this section, we test the performance of our proposed algorithms Primal-CR and Primal-CR++ on real world datasets, and compare with existing methods. All experiments are conducted on the UC Davis Illidan server with an Intel Xeon E5-2640 2.40GHz CPU and 64G RAM. We compare the following methods:

• Primal-CR and Primal-CR++: our proposed methods implemented in Julia. 2 • Collrank: the collaborative ranking algorithm proposed in [81] . We use the C++ code released by the authors, and they parallelized their algorithm using OpenMP.

• Cofirank: the classical collaborative ranking algorithm proposed in [114] . We use the C++ code released by the authors.

• MF: the classical matrix factorization model in (3.1) solved by SGD [61] .

We used three data sets (MovieLens1m, Movielens10m, Netflix data) to compare these algorithms. The dataset statistics are summarized in Table 3 .1. The regularization parameter λ used for each datasets are chosen by a random sampled validation set. For the pair-wise based algorithms, we covert the ratings into pair-wise comparisons, by saying that item j is preferred over item k by user i if user i gives a higher rating to item j over item k, and there will be no pair between two items if they have the same rating.

We compare the algorithms in the following three different ways:

• Objective function: since Collrank, Primal-CR, Primal-CR++ have the same objective function, we can compare the convergence speed in terms of the objective function (3.4) with squared hinge loss.

• Predicted pairwise error: the proportion of pairwise preference comparisons that we predicted correctly out of all the pairwise comparisons in the testing data: (3.20) where T represents the test data set and |T | denotes the size of test data set.

• NDCG@k: a standard performance measure of ranking, defined as:

where i represents i-th user and

In the DCG definition, π i (l) represents the index of the l-th ranked item for user i in test data based on the score matrix X = U T V generated, M is the rating matrix and M ij is the rating given to item j by user i. π * i is the ordering provided by the underlying ground truth of the rating.

3.5.1 Compare single thread versions using the same subsamples Since Collrank cannot scale to the full dataset of Movielens10m and Netflix, we sub-sample data using the same approach in their paper [81] and compare all the methods using the smaller training sets. More specifically, for each data set, we subsampled N ratings for training data and used the rest of ratings as test data. For this subsampled data, we discard users with less than N + 10 ratings, since we need at least 10 ratings for test data to compute the NDCG@10.

As shown in Figure 3 .1, 3.2, 3.3, both Primal-CR and Primal-CR++ perform considerably better than the existing Collrank algorithm. As data size increases, the performance gap becomes larger. As one can see, for Netflix data where N = 200, the speedup is more than 10 times compared to Collrank.

For Cofirank, we observe that it is even slower than Collrank, which confirms the experiments conducted in [81] . Furthermore, Cofirank cannot scale to larger datasets, so we omit the results in We also include the classical matrix factorization algorithm in the NDCG comparisons.

As shown in our complexity analysis, our proposed algorithms are competitive with MF in terms of speed, and MF is much faster than other collaborative ranking algorithms. Also, we observe that MF converges to a slightly worse solution in MovieLens10m and Netflix datasets, and converges to a much worse solution in MovieLens1m. The reason is that MF minimizes a simple mean square error, while our algorithms are minimizing ranking loss.

Based on the experimental results, our algorithm Primal-CR++ should be able to replace MF in many real world recommender systems. 

Since Collrank can be implemented in a parallel fashion, we also implemented the parallel version of our algorithm in Julia. We want to show our algorithm scales up well and is still much faster than Collrank in the multi-core shared memory setting. As shown in Using our algorithm, we have the ability to solve the full Netflix problem, so a natural question to ask is: Does using more training data help us predict and recommend better?

The answer is yes! We conduct the following experiments to verify this: For all the users with more than 20 ratings, we randomly choose 10 ratings as test data and out of the rest ratings we randomly choose up to C ratings per user as training data. One can see in Figure 3 .6, for the same test data, more training data leads to better prediction performance in terms of pairwise error and NDCG. Using all available ratings (C = d 2 )

gives lowest pairwise error and highest NDCG@10, using up to 200 ratings per user (C = 200) gives second lowest pairwise error and second highest NDCG@10, and using up to 100 ratings per user (C = 100) has the highest pairwise error and lowest NDCG@10.

Similar phenomenon is observed for Netflix data in Figure 3 .7. Collrank code does not work for C = 200 and C = d 2 and even for C = 100, it takes more than 20, 000 secs to converge while our Primal-CR++ takes less than 5, 000 secs for the full Netflix data. The speedup of our algorithm will be even more for a larger C or larger data size d 1 and d 2 . We tried to create input file without subsampling for Collrank, we created 344GB input data file and Collrank reported memory error message "Segmentation Fault". We also tried C = 200, still got the same error message. It is possible to implement Collrank algorithm by directly working on the rating data, but the time complexity remains the same, so it is clear that our proposed Primal-CR and Primal-CR++ algorithms are much faster.

To the best of our knowledge, our algorithm is the first ranking-based algorithm that can scale to full Netflix data set using a single core, and without sub-sampling. Our 

We considered the collaborative ranking problem setting in which a low-rank matrix is fitted to the data in the form of pairwise comparisons or numerical ratings. We proposed our new optimization algorithms Primal-CR and Primal-CR++ where the time complexity is much better than all the existing approaches. We showed that our algorithms are much faster than state-of-the-art collaborative ranking algorithms on real data sets (MovieLens1m, Movielens10m and Netflix) using same subsampling scheme, and moreover our algorithm is the only one that can scale to the full Movielens10m and Netflix data. We observed that our algorithm has the same efficiency with matrix factorization, while achieving better NDCG since we minimize ranking loss. As a result, we expect our algorithm to be able to replace matrix factorization in many real applications.

We study a novel approach to collaborative ranking-the personalized ranking of items for users based on their observed preferences-through the use of listwise losses, which are dependent only on the observed rankings of items by users. We propose the SQL-Rank algorithm, which can handle ties and missingness, incorporate both explicit ratings and more implicit feedback, provides personalized rankings, and is based on the relative rankings of items. To better understand the proposed contributions, let us begin with a brief history of the topic.

Recommendation systems, found in many modern web applications, movie streaming services, and social media, rank new items for users and are judged based on user engagement (implicit feedback) and ratings (explicit feedback) of the recommended items.

A high-quality recommendation system must understand the popularity of an item and infer a user's specific preferences with limited data. Collaborative filtering, introduced in [44] , refers to the use of an entire community's preferences to better predict the preferences of an individual (see [95] for an overview). In systems where users provide ratings of items, collaborative filtering can be approached as a point-wise prediction task, in which we attempt to predict the unobserved ratings [80] . Low rank methods, in which the rating distribution is parametrized by a low rank matrix (meaning that there are a few latent factors) provides a powerful framework for estimating ratings [59, 73] . There are several issues with this approach. One issue is that the feedback may not be representative of the unobserved entries due to a sampling bias, an effect that is prevalent when the items are only 'liked' or the feedback is implicit because it is inferred from user engagement.

Augmenting techniques like weighting were introduced to the matrix factorization objective to overcome this problem [48, 49] . Many other techniques are also introduced [55, 112, 120] .

Another methodology worth noting is the CofiRank algorithm of [113] which minimizes a convex surrogate of the normalized discounted cumulative gain (NDCG). The pointwise framework has other flaws, chief among them is that in recommendation systems we are not interested in predicting ratings or engagement, but rather we must rank the items.

Ranking is an inherently relative exercise. Because users have different standards for ratings, it is often desirable for ranking algorithms to rely only on relative rankings and not absolute ratings. A ranking loss is one that only considers a user's relative preferences between items, and ignores the absolute value of the ratings entirely, thus deviating from the pointwise framework. Ranking losses can be characterized as pairwise and listwise.

A pairwise method decomposes the objective into pairs of items j, k for a user i, and effectively asks 'did we successfully predict the comparison between j and k for user i?'. The comparison is a binary response-user i liked j more than or less than k-with possible missing values in the event of ties or unobserved preferences. Because the pairwise model has cast the problem in the classification framework, then tools like support vector machines were used to learn rankings; [54] introduces rankSVM and efficient solvers can be found in [17] . Much of the existing literature focuses on learning a single ranking for all users, which we will call simple ranking [2, 28, 77] . This work will focus on the personalized ranking setting, in which the ranking is dependent on the user.

Pairwise methods for personalized ranking have seen great advances in recent years,

with the AltSVM algorithm of [81] , Bayesian personalized ranking (BPR) of [91] , and the near linear-time algorithm of [115] . Nevertheless, pairwise algorithms implicitly assume that the item comparisons are independent, because the objective can be decomposed

where each comparison has equal weight. Listwise losses instead assign a loss, via a generative model, to the entire observed ranking, which can be thought of as a permutation of the m items, instead of each comparison independently. The listwise permutation model, introduced in [15] , can be thought of as a weighted urn model, where items correspond to balls in an urn and they are sequentially plucked from the urn with probability proportional to φ(X ij ) where X ij is the latent score for user i and item j and φ is some non-negative function. They proposed to learn rankings by optimizing a cross entropy between the probability of k items being at the top of the ranking and the observed ranking, which they combine with a neural network, resulting in the ListNet algorithm. [100] applies this idea to collaborative ranking, but uses only the top-1 probability because of the computational complexity of using top-k in this setting. This was extended in [50] to incorporate neighborhood information. [121] instead proposes a maximum likelihood framework that uses the permutation probability directly, which enjoyed some empirical success.

Very little is understood about the theoretical performance of listwise methods. [15] demonstrates that the listwise loss has some basic desirable properties such as monotonicity,

i.e. increasing the score of an item will tend to make it more highly ranked. [65] studies the generalizability of several listwise losses, using the local Rademacher complexity, and found that the excess risk could be bounded by a 1/ √ n term (recall, n is the number of users). Two main issues with this work are that no dependence on the number of items is given-it seems these results do not hold when m is increasing-and the scores are not personalized to specific users, meaning that they assume that each user is an independent and identically distributed observation. A simple open problem is: can we consistently learn preferences from a single user's data if we are given item features and we assume a simple parametric model? (n = 1, m → ∞.)

We can summarize the shortcomings of the existing work: current listwise methods for collaborative ranking rely on the top-1 loss, algorithms involving the full permutation probability are computationally expensive, little is known about the theoretical performance of listwise methods, and few frameworks are flexible enough to handle explicit and implicit data with ties and missingness. This chapter addresses each of these in turn by proposing and analyzing the SQL-rank algorithm.

• We propose the SQL-Rank method, which is motivated by the permutation probability, and has advantages over the previous listwise method using cross entropy loss.

• We provide an O(iter · (|Ω|r)) linear algorithm based on stochastic gradient descent,

where Ω is the set of observed ratings and r is the rank.

• The methodology can incorporate both implicit and explicit feedback, and can gracefully handle ties and missing data.

• We provide a theoretical framework for analyzing listwise methods, and apply this to the simple ranking and personalized ranking settings, highlighting the dependence on the number of users and items.

The permutation probability, [15] , is a generative model for the ranking parametrized by latent scores. First assume there exists a ranking function that assigns scores to all the items. Let's say we have m items, then the scores assigned can be represented as a vector s = (s 1 , s 2 , ..., s m ). Denote a particular permutation (or ordering) of the m items as π,

which is a random variable and takes values from the set of all possible permutations S m (the symmetric group on m elements). π 1 denotes the index of highest ranked item and π m is the lowest ranked. The probability of obtaining π is defined to be

where φ(.) is an increasing and strictly positive function. An interpretation of this model is that each item is drawn without replacement with probability proportional to φ(s i ) for item i in each step. One can easily show that P s (π) is a valid probability distribution,

i.e. π∈Sm P s (π) = 1, P s (π) > 0, ∀π. Furthermore, this definition of permutation probability enjoys several favorable properties (see [15] ). For any permutation π if you swap two elements ranked at i < j generating the permutation π (π i = π j , π j = π i , π k = π k , k = i, j), if s π i > s π j then P s (π) > P s (π ). Also, if permutation π satisfies s π i > s π i+1 , ∀i, then we have π = arg max π ∈Sm P s (π ). Both of these properties can be summarized: larger scores will tend to be ranked more highly than lower scores. These properties are required for the negative log-likelihood to be considered sound for ranking [121] .

In recommendation systems, the top ranked items can be more impactful for the performance. In order to focus on the top k ranked items, we can compute the partialranking marginal probability,

.

It is a common occurrence that only a proportion of the m items are ranked, and in that case we will allowm ≤ m to be the number of observed rankings (we assume that π 1 , . . . , πm are the complete list of ranked items). When k = 1, the first summation vanishes and top-1 probability can be calculated straightforwardly, which is why k = 1 is widely used in previous listwise approaches for collaborative ranking. Counter-intuitively, we demonstrate that using a larger k tends to improve the ranking performance.

We see that computing the likelihood loss is linear in the number of ranked items, which is in contrast to the cross-entropy loss used in [15] , which takes exponential time in k. The cross-entropy loss is also not sound, i.e. it can rank worse scoring permutations more highly, but the negative log-likelihood is sound. We will discuss how we can deal with ties in the following subsection, namely, when the ranking is derived from ratings and multiple items receive the same rating, then there is ambiguity as to the order of the tied items. This is a common occurrence when the data is implicit, namely the output is whether the user engaged with the item or not, yet did not provide explicit feedback.

Because the output is binary, the cross-entropy loss (which is based on top-k probability with k very small) will perform very poorly because there will be many ties for the top ranked items. To this end, we propose a collaborative ranking algorithm using the listwise likelihood that can accommodate ties and missingness, which we call Stochastic Queuing Listwise Ranking, or SQL-Rank. 

The goal of collaborative ranking is to predict a personalized score X ij that reflects the preference level of user i towards item j, where 1 ≤ i ≤ n and 1 ≤ j ≤ m. It is reasonable to assume the matrix X ∈ R n×m to be low rank because there are only a small number of latent factors contributing to users' preferences. The input data is given in the form of "user i gives item j a relevance score R ij ". Note that for simplicity we assume all the users have the same numberm of ratings, but this can be easily generalized to the non-uniform case by replacingm with m i (number of ratings for user i).

With our scores X and our ratings R, we can specify our collaborative ranking model using the permutation probability (4.2). Let Π i be a ranking permutation of items for user i (extracted from R), we can stack Π 1 , . . . Π n , row by row, to get the permutation matrix Π ∈ R n×m . Assuming users are independent with each other, the probability of observing a particular Π given the scoring matrix X can be written as

We will assume that log φ(x) = 1/(1 + exp(−x)) is the sigmoid function. This has the advantage of bounding the resulting weights, φ(X ij ), and maintaining their positivity without adding additional constraints.

Typical rating data will contain many ties within each row. In such cases, the permutation Π is no longer unique and there is a set of permutations that coincides with rating because with any candidate Π we can arbitrarily shuffle the ordering of items with the same relevance scores to generate a new candidate matrix Π which is still valid (see To learn the scoring matrix X, we can naturally solve the following maximum likelihood estimator with low-rank constraint:

where X is the structural constraint of the scoring matrix. To enforce low-rankness, we use the nuclear norm regularization X = {X : X * ≤ r}.

Eq This upper bound is much easier to optimize and can be solved using Stochastic Gradient Descent (SGD).

Next we discuss how to apply our model for explicit and implicit feedback settings. In the explicit feedback setting, it is assumed that the matrix R is partially observed and the observed entries are explicit ratings in a range (e.g., 1 to 5). We will show in the experiments that k =m (using the full list) leads to the best results. [50] also observed that increasing k is useful for their cross-entropy loss, but they were not able to increase k since their model has time complexity exponential to k.

In the implicit feedback setting each element of R ij is either 1 or 0, where 1 means positive actions (e.g., click or like) and 0 means no action is observed. Directly solving (4.5)

will be expensive sincem = m and the computation will involve all the mn elements at each iteration. Moreover, the 0's in the matrix could mean either a lower relevance score or missing, thus should contribute less to the objective function. Therefore, we adopt the idea of negative sampling [71] in our list-wise formulation. For each user (row of R), assume there arem 1's, we then sample ρm unobserved entries uniformly from the same row and append to the back of the list. This then becomes the problem withm = (1 + ρ)m and then we use the same algorithm in explicit feedback setting to conduct updates. We then repeat the sampling process at the end of each iteration, so the update will be based on different set of 0's at each time.

Despite the advantage of the objective function in equation (4.5) being convex, it is still not feasible for large-scale problems since the scoring matrix X ∈ R n×m leads to high computational and memory cost. We follow a common trick to transform (4.5) to the non-convex form by replacing X = U T V : with U ∈ R r×n , V ∈ R r×m so that the objective 

where u i , v j are columns of U, V respectively. We apply stochastic gradient descent to solve this problem. At each step, we choose a permutation matrix Π ∈ S(R, Ω) using the stochastic queuing process (Algorithm 8) and then update U, V by ∇f (U, V ). For example, the gradient with respect to V is (g = log φ is the sigmoid function),

where Ω j denotes the set of users that have rated the item j and rank i (j) is a function gives the rank of the item j for that user i. Because g is the sigmoid function, g = g · (1 − g).

The gradient with respect to U can be derived similarly.

As one can see, a naive way to compute the gradient of f requires O(nm 2 r) time, which is very slow even for one iteration. However, we show in Algorithm 12 (in the appendix) that there is a smart way to re-arranging the computation so that ∇ V f (U, V ) can be computed in O(nmr) time, which makes our SQL-Rank a linear-time algorithm (with the same per-iteration complexity as classical matrix factorization).

In this section, we compare our proposed algorithm (SQL-Rank) with other state-of-the-art algorithms on real world datasets. Note that our algorithm works for both implicit feedback and explicit feedback settings. In the implicit feedback setting, all the ratings are 0 or 1;

in the explicit feedback setting, explicit ratings (e.g., 1 to 5) are given but only to a subset of user-item pairs. Since many real world recommendation systems follow the implicit feedback setting (e.g., purchases, clicks, or checkins), we will first compare SQL-Rank on implicit feedback datasets and show it outperforms state-of-the-art algorithms. Then we will verify that our algorithm also performs well on explicit feedback problems. All experiments are conducted on a server with an Intel Xeon E5-2640 2.40GHz CPU and 64G RAM.

In the implicit feedback setting we compare the following methods:

• SQL-Rank: our proposed algorithm implemented in Julia 1 .

1 https://github.com/wuliwei9278/SQL-Rank

• Weighted-MF: the weighted matrix factorization algorithm by putting different weights on 0 and 1's [48, 49] .

• BPR: the Bayesian personalized ranking method motivated by MLE [91] . For both Weighted-MF and BPR, we use the C++ code by Quora 2 .

Note that other collaborative ranking methods such as Pirmal-CR++ [115] and List-MF [100] do not work for implicit feedback data, and we will compare with them later in the explicit feedback experiments. For the performance metric, we use precision@k for k = 1, 5, 10 defined by

where R is the rating matrix and Π il gives the index of the l-th ranked item for user i among all the items not rated by user i in the training set.

We use rank r = 100 and tune regularization parameters for all three algorithms using a random sampled validation set. For Weighted-MF, we also tune the confidence weights on unobserved data. For BPR and SQL-Rank, we fix the ratio of subsampled unobserved 0's versus observed 1's to be 3 : 1, which gives the best performance for both BPR and SQL-rank in practice.

We experiment on the following four datasets. Note that the original data of Movie-lens1m, Amazon and Yahoo-music are ratings from 1 to 5, so we follow the procedure in [91, 124] to preprocess the data. We transform ratings of 4, 5 into 1's and the rest entries (with rating 1, 2, 3 and unknown) as 0's. Also, we remove users with very few 1's in the corresponding row to make sure there are enough 1's for both training and testing. For

Amazon, Yahoo-music and Foursquare, we discard users with less than 20 ratings and randomly select 10 1's as training and use the rest as testing. Movielens1m has more ratings than others, so we keep users with more than 60 ratings, and randomly sample 50 of them as training.

• Movielens1m: a popular movie recommendation data with 6, 040 users and 3, 952

items. • Yahoo-music: the Yahoo music rating data set 4 which contains 15, 400 users and 1, 000 items.

• Foursquare: a location check-in data 5 . The data set contains 3, 112 users and 3, 298 venues with 27, 149 check-ins. The data set is already in the form of "0/1" so we do not need to do any transformation.

The experimental results are shown in Table 4 .1. We find that SQL-Rank outperforms both Weighted-MF and BPR in most cases.

Next we compare the following methods in the explicit feedback setting:

• SQL-Rank: our proposed algorithm implemented in Julia. Note that in the explicit feedback setting our algorithm only considers pairs with explicit ratings.

• List-MF: the listwise algorithm using the cross entropy loss between observed rating and top 1 probability [100] . We use the C++ implementation on github 6 .

• MF: the classical matrix factorization algorithm in [59] utilizing a pointwise loss solved by SGD. We implemented SGD in Julia.

• Primal-CR++: the recently proposed pairwise algorithm in [115] . We use the Julia implementation released by the authors 7 .

Experiments are conducted on Movielens1m and Yahoo-music datasets. We perform the same procedure as in implicit feedback setting except that we do not need to mask the ratings into "0/1".

We measure the performance in the following two ways:

• NDCG@k: defined as:

where i represents i-th user and

In the DCG definition, Π il represents the index of the l-th ranked item for user i in test data based on the learned score matrix X. R is the rating matrix and R ij is the rating given to item j by user i. Π * i is the ordering provided by the ground truth rating.

• Precision@k: defined as a fraction of relevant items among the top k recommended items: here we consider items with ratings assigned as 4 or 5 as relevant. R ij follows the same definitions above but unlike before Π il gives the index of the l-th ranked item for user i among all the items that are not rated by user i in the training set (including both rated test items and unobserved items).

As shown in Table 4 

To illustrate the training speed of our algorithm, we plot precision@1 versus training time for the Movielen1m dataset and the Foursquare dataset. 

One important innovation in our SQL-Rank algorithm is the Stochastic Queuing (SQ)

Process for handling ties. To illustrate the effectiveness of the SQ process, we compare As shown Table 4 .3 and Figure 8 .3 (in the appendix), the performance gain from SQ in terms of precision is substantial (more than 10%) on Movielen1m dataset. It verifies the claim that our way of handling ties and missing data is very effective and improves the ranking results by a lot. 

Another benefit of our algorithm is that we are able to minimize top k probability with much larger k and without much overhead. Previous approaches [50] already pointed out increasing k leads to better ranking results, but their complexity is exponential to k so they were not able to have k > 1. To show the effectiveness of using permutation probability for full lists rather than using the top k probability for top k partial lists in the likelihood loss, we fix everything else to be the same and only vary k in Equation (4.5).

We obtain the results in Table 4 .4 and Figure 8 .4 (in the appendix). It shows that the larger k we use, the better the results we can get. Therefore, in the final model, we set k to be the maximum number (length of the observed list.)

In this chapter, we propose a listwise approach for collaborative ranking and provide an efficient algorithm to solve it. Our methodology can incorporate both implicit and explicit feedback, and can gracefully handle ties and missing data. In experiments, we demonstrate our algorithm outperforms existing state-of-the art methods in terms of top k recommendation precision. We also provide a theoretical framework for analyzing listwise methods highlighting the dependence on the number of users and items.

Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers

Recently, embedding representations have been widely used in almost all AI-related fields, from feature maps [63] in computer vision, to word embeddings [71, 83] in natural language processing, to user/item embeddings [49, 73] in recommender systems. Usually, the embeddings are high-dimensional vectors. Take language models for example, in GPT [87] and Bert-Base model [25] , 768-dimensional vectors are used to represent words.

higher dimensions in their unreleased large models. In recommender systems, things are slightly different: the dimension of user/item embeddings are usually set to be reasonably small, 50 or 100. But the number of users and items is on a much bigger scale. Contrast this with the fact that the size of word vocabulary that normally ranges from 50,000 to 150,000, the number of users and items can be millions or even billions in large-scale real-world commercial recommender systems [6] .

Given the massive number of parameters in modern neural networks with embedding layers, mitigating over-parameterization can play a big role in preventing over-fitting in deep learning. We propose a regularization method, Stochastic Shared Embeddings (SSE), that uses prior information about similarities between embeddings, such as semantically and grammatically related words in natural languages or real-world users who share social relationships. Critically, SSE progresses by stochastically transitioning between embeddings as opposed to a more brute-force regularization such as graph-based Laplacian regularization and ridge regularization. Thus, SSE integrates seamlessly with existing stochastic optimization methods and the resulting regularization is data-driven.

We will begin the paper with the mathematical formulation of the problem, propose SSE, and provide the motivations behind SSE. We provide a theoretical analysis of SSE that can be compared with excess risk bounds based on empirical Rademacher complexity.

We then conducted experiments for a total of 6 tasks from simple neural networks with one hidden layer in recommender systems, to the transformer and BERT in natural languages and find that when used along with widely-used regularization methods such as weight decay and dropout, our proposed methods can further reduce over-fitting, which often leads to more favorable generalization results.

Regularization techniques are used to control model complexity and avoid over-fitting. 2 regularization [46] is the most widely used approach and has been used in many matrix factorization models in recommender systems; 1 regularization [109] is used when a sparse model is preferred. For deep neural networks, it has been shown that p regularizations are often too weak, while dropout [45, 106] is more effective in practice. There are many other regularization techniques, including parameter sharing [30] , max-norm regularization [104] , gradient clipping [82] , etc.

Our proposed SSE-graph is very different from graph Laplacian regularization [13] , in which the distances of any two embeddings connected over the graph are directly penalized.

Hard parameter sharing uses one embedding to replace all distinct embeddings in the same group, which inevitably introduces a significant bias. Soft parameter sharing [75] is similar to the graph Laplacian, penalizing the l 2 distances between any two embeddings.

These methods have no dependence on the loss, while the proposed SSE-graph method is for each embedding E l [j i l ] ∈ S i do 8:

Forward and backward pass with the new embeddings 10: Return embeddings {E 1 , . . . , E M }, and neural network parameters Θ data-driven in that the loss influences the effect of regularization. Unlike graph Laplacian regularization, hard and soft parameter sharing, our method is stochastic by nature. This allows our model to enjoy similar advantages as dropout [106] .

Interestingly, in the original BERT model's pre-training stage [25] , a variant of SSE-SE is already implicitly used for token embeddings but for a different reason. In [25] , the authors masked 15% of words and 10% of the time replaced the [mask] token with a random token. In the next section, we discuss how SSE-SE differs from this heuristic.

Another closely related technique to ours is the label smoothing [107] , which is widely used in the computer vision community. We find that in the classification setting if we apply SSE-SE to one-hot encodings associated with output y i only, our SSE-SE is closely related to the label smoothing, which can be treated as a special case of our proposed method.

Throughout this chapter, the network input x i and label y i will be encoded into indices The loss function can be written as the functions of embeddings:

where y i is the label and Θ encompasses all trainable parameters including the embeddings,

The loss function is a mapping from embedding spaces to the reals. For text input, each E l [j i l ] is a word embedding vector in the input sentence or document. For recommender systems, usually there are two embedding look-up tables: one for users and one for items [41] . So the objective function, such as mean squared loss or some ranking losses, will comprise both user and item embeddings for each input. We can more succinctly Suppose that we have access to knowledge graphs [66, 72] over embeddings, and we have a prior belief that two embeddings will share information and replacing one with the other should not incur a significant change in the loss distribution. For example, if two movies are both comedies and they are starred by the same actors, it is very likely that for the same user, replacing one comedy movie with the other comedy movie will result in little change in the loss distribution. In stochastic optimization, we can replace the loss gradient for one movie's embedding with the other similar movie's embedding, and this will not significantly bias the gradient if the prior belief is accurate. On the other hand, if this exchange is stochastic, then it will act to smooth the gradient steps in the long run, thus regularizing the gradient updates. 

Instead of optimizing objective function R n (Θ) in (5.1), SSE-Graph described in Algorithm 9, Figure 5 and suppress their indices.

In the single embedding table case, M = 1, there are many ways to define transition probability from j to k. One simple and effective way is to use a random walk (with random restart and self-loops) on a knowledge graph G, i.e. when embedding j is connected with k but not with l, we can set the ratio of p(j, k|Φ) and p(j, l|Φ) to be a constant greater than 1. In more formal notation, we have j ∼ k, j ∼ l −→ p(j, k|Φ)/p(j, l|Φ) = ρ,

where ρ > 1 and is a tuning parameter. It is motivated by the fact that embeddings connected with each other in knowledge graphs should bear more resemblance and thus be more likely replaced by each other. Also, we let p(j, j|Φ) = 1 − p 0 , where p 0 is called the SSE probability and embedding retainment probability is 1 − p 0 . We treat both p 0 and ρ as tuning hyper-parameters in experiments. With (5.3) and k p(j, k|Φ) = 1, we can derive transition probabilities between any two embeddings to fill out the transition probability table.

When there are multiple embedding tables, M > 1, then we will force that the transition from j to k can be thought of as independent transitions from j l to k l within embedding table l (and index set I l ). Each table may have its own knowledge graph, resulting in its own transition probabilities p l (., .). The more general form of the SSE-graph objective is given below: This is equivalent to have a randomized embedding look-up layer as shown in Figure 5 .1.

We can also accommodate sequences of embeddings, which commonly occur in natural language application, by considering (j i l,1 , k l,1 ), . . . , (j i l,n i l , k l,n i l ) instead of (j i l , k l ) for l-th embedding table in (5.4), where 1 ≤ l ≤ M and n i l is the number of embeddings in table l that are associated with (x i , y i ). When there is more than one embedding look-up table, we sometimes prefer to use different p 0 and ρ for different look-up tables in (5.3) and the SSE probability constraint. For example, in recommender systems, we would use p u , ρ u for user embedding table and p i , ρ i for item embedding table.

We find that SSE with knowledge graphs, i.e., SSE-Graph, can force similar embeddings to cluster when compared to the original neural network without SSE-Graph. In Figure 5 .3, one can easily see that more embeddings tend to cluster into 2 black holes after applying SSE-Graph when embeddings are projected into 3D spaces using PCA. Interestingly, a similar phenomenon occurs when assuming the knowledge graph is a complete graph, which we would introduce as SSE-SE below.

One clear limitation of applying the SSE-Graph is that not every dataset comes with good-quality knowledge graphs on embeddings. For those cases, we could assume there is a complete graph over all embeddings so there is a small transition probability between every pair of different embeddings: where N is the size of the embedding table. The SGD procedure in Algorithm 9 can still be applied and we call this algorithm SSE-SE (Stochastic Shared Embeddings -Simple and Easy). It is worth noting that SSE-Graph and SSE-SE are applied to embeddings associated with not only input x i but also those with output y i . Unless there are considerably many more embeddings than data points and model is significantly overfitting, normally p 0 = 0.01

gives reasonably good results.

Interestingly, we found that the SSE-SE framework is related to several techniques used in practice. For example, BERT pre-training unintentionally applied a method similar to SSE-SE to input x i by replacing the masked word with a random word. This would implicitly introduce an SSE layer for input x i in Figure 5 .1, because now embeddings associated with input x i be stochastically mapped according to (5.5) . The main difference between this and SSE-SE is that it merely augments the input once, while SSE introduces randomization at every iteration, and we can also accommodate label embeddings. In experimental Section 5.4.4, we will show that SSE-SE would improve original BERT pre-training procedure as well as fine-tuning procedure.

We explain why SSE can reduce the variance of estimators and thus leads to better generalization performance. For simplicity, we consider the SSE-graph objective (5.2) where there is no transition associated with the label y i , and only the embeddings associated with the input x i undergo a transition. When this is the case, we can think of the loss as a function of the x i embedding and the label, (E[j i ], y i ; Θ). We take this approach because it is more straightforward to compare our resulting theory to existing excess risk bounds.

The SSE objective in the case of only input transitions can be written as,

and there may be some constraint on Θ. LetΘ denote the minimizer of S n subject to this constraint. We will show in the subsequent theory that minimizing S n will get us close to a minimizer of S(Θ) = ES n (Θ), and that under some conditions this will get us close to the Bayes risk. We will use the standard definitions of empirical and true risk,

Our results depend on the following decomposition of the risk. By optimality ofΘ,

where The high level idea behind the following results is that when the SSE protocol reflects the underlying distribution of the data, then the bias term B(Θ) is small, and if the SSE transitions are well mixing then the SSE excess risk E(Θ) will be of smaller order than the standard Rademacher complexity. This will result in a small excess risk.

Theorem 1. Consider SSE-graph with only input transitions. Transition matrices are contractive and will induce dependencies between the Rademacher random variables, thereby stochastically reducing the supremum. In the case of no label noise, namely that Y |X is a point mass, e(x, y; Θ) = 0, and ρ e,n = 0. The use of L as opposed to the losses, , will also make ρ L,n of smaller order than the standard empirical Rademacher complexity. We demonstrate this with a partial simulation of ρ L,n on the Movielens1m dataset in Figure 8 .5 of the Appendix.

Suppose that 0 ≤ (., .; Θ) ≤ b for some b > 0, then

Remark 2. The price for 'smoothing' the Rademacher complexity in Theorem 1 is that SSE may introduce a bias. This will be particularly prominent when the SSE transitions have little to do with the underlying distribution of Y, X. On the other extreme, suppose that p(j, k) is non-zero over a neighborhood N j of j, and that for data x , y with encoding k ∈ N j , x , y is identically distributed with x i , y i , then B = 0. In all likelihood, the SSE transition probabilities will not be supported over neighborhoods of iid random pairs, but with a well chosen SSE protocol the neighborhoods contain approximately iid pairs and B is small. We report the metric precision for top k recommendations as P @k.

Model P @1 P @5 P @10 P @1 P @5 P @10 P @1 P @5 P 

We have conducted extensive experiments on 6 tasks, including 3 recommendation tasks (explicit feedback, implicit feedback and sequential recommendation) and 3 NLP tasks (neural machine translation, BERT pre-training, and BERT fine-tuning for sentiment classification) and found that our proposed SSE can effectively improve generalization performances on a wide variety of tasks. Note that the details about datasets and parameter settings can be found in the appendix.

Matrix Factorization Algorithm (MF) [73] and Bayesian Personalized Ranking Algorithm (BPR) [91] can be viewed as neural networks with one hidden layer (latent features) and

are quite popular in recommendation tasks. MF uses the squared loss designed for explicit feedback data while BPR uses the pairwise ranking loss designed for implicit feedback data.

First, we conduct experiments on two explicit feedback datasets: Movielens1m and

Movielens10m. For these datasets, we can construct graphs based on actors/actresses starring the movies. We compare SSE-graph and the popular Graph Laplacian Regularization (GLR) method [89] in Table 5 .1. The results show that SSE-graph consistently outperforms GLR. This indicates that our SSE-Graph has greater potentials over graph Laplacian regularization as we do not explicitly penalize the distances across embeddings, but rather we implicitly penalize the effects of similar embeddings on the loss. Furthermore, we show that even without existing knowledge graphs of embeddings, our SSE-SE performs only slightly worse than SSE-Graph but still much better than GLR and MF.

In general, SSE-SE is a good alternative when graph information is not available. We then show that our proposed SSE-SE can be used together with standard regularization techniques such as dropout and weight decay to improve recommendation results regardless of the loss functions and dimensionality of embeddings. This is evident in Table 5.2 and   Table 5 .3. With the help of SSE-SE, BPR can perform better than the state-of-art listwise approach SQL-Rank [116] in most cases. We include the optimal SSE parameters in the Table 5 .4. SSE-SE has two tuning parameters: probability p x to replace embeddings associated with input x i and probability p y to replace embeddings associated with output y i . We use the dropout probability of 0.1, weight decay of 1e −5 , and learning rate of 1e −3 for all experiments. 

SASRec [56] is the state-of-the-arts algorithm for sequential recommendation task. It applies the transformer model [110] , where a sequence of items purchased by a user can be viewed as a sentence in transformer, and next item prediction is equivalent to next word prediction in the language model. In Table 5 .4, we perform SSE-SE on input embeddings (p x = 0.1, p y = 0), output embeddings (p x = 0.1, p y = 0) and both embeddings (p x = p y = 0.1), and observe that all of them significantly improve over state-of-the-art SASRec (p x = p y = 0). The regularization effects of SSE-SE is even more obvious when we increase the number of self-attention blocks from 2 to 6, as this will lead to a more sophisticated model with many more parameters. This leads to the model overfitting terribly even with dropout and weight decay. We can see in Table 5 .4 that when both methods use dropout and weight decay, SSE-SE + SASRec is doing much better than SASRec without SSE-SE. Table 5 .5. Our proposed SSE-SE helps the Transformer achieve better BLEU scores on English-to-German in 10 out of 11 newstest data between 2008 and 2018. 

We use the transformer model [110] as the backbone for our experiments. The baseline model is the standard 6-layer transformer architecture and we apply SSE-SE to both encoder, and decoder by replacing corresponding vocabularies' embeddings in the source and target sentences. We trained on the standard WMT 2014 English to German dataset which consists of roughly 4.5 million parallel sentence pairs and tested on WMT 2008

to 2018 news-test sets. We use the OpenNMT implementation in our experiments. We use the same dropout rate of 0.1 and label smoothing value of 0.1 for the baseline model and our SSE-enhanced model. The only difference between the two models is whether or not we use our proposed SSE-SE with p 0 = 0.01 in (5.5) for both encoder and decoder embedding layers. We evaluate both models' performances on the test datasets using BLEU scores [85] .

We summarize our results in Table 5 

BERT's model architecture [25] is a multi-layer bidirectional Transformer encoder based on the Transformer model in neural machine translation. Despite SSE-SE can be used for both pre-training and fine-tuning stages of BERT, we want to mainly focus on pre-training as fine-tuning bears more similarity to the previous section. We use SSE probability of 0.015 for embeddings (one-hot encodings) associated with labels and SSE probability of We continue to pre-train Google pre-trained BERT model on our crawled IMDB movie reviews with and without SSE-SE and compare downstream tasks performances.

In Table 5 .6, we find that SSE-SE pre-trained BERT base model helps us achieve the state-of-the-art results for the IMDB sentiment classification task, which is better than the previous best in [47] . We report test set accuracy of 0.9542 after fine-tuning for one epoch only. For the similar SST-2 sentiment classification task in Table 5 

In Figure 5 .4, it is clear to see that our one-hidden-layer neural networks with SSE-SE are achieving much better generalization results than their respective standalone versions.

One can also easily spot that SSE-version algorithms converge at much faster speeds with the same learning rate.

We have proposed Stochastic Shared Embeddings, which is a data-driven approach to regularization, that stands in contrast to brute force regularization such as Laplacian and ridge regularization. Our theory is a first step towards explaining the regularization effect of SSE, particularly, by 'smoothing' the Rademacher complexity. The extensive experimentation demonstrates that SSE can be fruitfully integrated into existing deep learning applications.

Chapter 6 SSE-PT: Sequential Recommendation Via Personalized Transformer

The sequential recommendation problem has been an important open research question, yet using temporal information to improve recommendation performance has proven to be challenging. SASRec, proposed by [56] for sequential recommendation problems, has achieved state-of-the-art results and enjoyed more than 10x speed-up when compared to earlier CNN/RNN-based methods. However, the model used in SASRec is the standard Transformer which is inherently an un-personalized model. In practice, it is important to include a personalized Transformer in SASRec especially for recommender systems, but [56] found that adding additional personalized embeddings did not improve the performance of their Transformer model, and postulate that the failure of adding personalization is due to the fact that they already use the user history and the user embeddings only contribute to overfitting. In this work, we propose a novel method, Personalized Transformer (SSE-PT), that successfully introduces personalization into self-attentive neural network architectures.

Introducing user embeddings into the standard transformer model is intrinsically difficult with existing regularization techniques, as unavoidably a large number of user parameters are introduced, which is often at the same scale of the number of training data. But we show that personalization can greatly improve ranking performance with a recent regularization technique called Stochastic Shared Embeddings (SSE) [117] . The personalized Transformer (SSE-PT) model with SSE regularization works well for all 5 real-world datasets we consider without overfitting, outperforming previous state-of-the-art algorithm SASRec by almost 5% in terms of NDCG@10. Furthermore, after examining some random users' engagement history, we find our model is not only more interpretable but also able to focus on recent engagement patterns for each user. Moreover, our SSE-PT model with a slight modification, which we call SSE-PT++, can handle extremely long sequences and outperform SASRec in ranking results with comparable training speed, striking a balance between performance and speed requirements.

Both session-based and sequential (i.e., next-basket) recommendation algorithms take advantage of additional temporal information to make better personalized recommendations. The main difference between session-based recommendations [43] and sequential recommendations [56] is that the former assumes that the user ids are not recorded and therefore the length of engagement sequences are relatively short. Therefore, session-based recommendations normally do not consider user factors. On the other hand, sequential recommendation treats each sequence as a user's engagement history [56] . Both settings, do not explicitly require time-stamps: only the relative temporal orderings are assumed known (in contrast to, for example, timeSVD++ [60] using time-stamps). Initially, sequence data in temporal order are usually modelled with Markov models, in which a future observation is conditioned on the last few observed items [92] . In [92] , a personalized Markov model with user latent factors is proposed for more personalized results.

In recent years, deep learning techniques, borrowed from natural language processing (NLP) literature, are getting widely used in tackling sequential data. Like word sentences in NLP, item sequences in recommendations can be similarly modelled by recurrent neural networks (RNN) [42, 43] and convolutional neural network (CNN) [108] models.

Recently, attention models are increasingly used in both NLP [25, 110] and recommender systems [56, 68] . SASRec [56] is a recent method with state-of-the-art performance among the many deep learning models. Motivated by the Transformer model in neural machine translation [110] , SASRec utilizes a similar architecture to the encoder part of the Transformer model. Our proposed model, SSE-PT, is a personalized extension of the transformer model.

In deep learning, models with many more parameters than data points can easily overfit to the training data. This may prevent us from adding user embeddings as additional parameters into complicated models like the Transformer model [56] , which can easily have 20 layers with millions of parameters for a medium-sized dataset like Movielens10M [38] .

2 regularization [46] is the most widely used approach and has been used in many matrix factorization models in recommender systems; 1 regularization [109] is used when a sparse model is preferred. For deep neural networks, it has been shown that p regularizations are often too weak, while dropout [45, 106] is more effective in practice. There are many other regularization techniques, including parameter sharing [30] , max-norm regularization [104] , gradient clipping [82] , etc. Very recently, a new regularization technique called Stochastic Shared Embeddings (SSE) [117] is proposed as a new means of regularizing embedding layers. We find that the base version SSE-SE is essential to the success of our Personalized Transformer (SSE-PT) model.

Given n users and each user engaging with a subset of m items in a temporal order, the goal of sequential recommendation is to learn a good personalized ranking of top K items out of total m items for any given user at any given time point. We assume data in the format of n item sequences:

Sequences s i of length T contain indices of the last T items that user i has interacted with in the temporal order (from old to new). For different users, the sequence lengths can vary, but we can pad the shorter sequences so all of them have length T . We cannot simply randomly split data points into train/validation/test sets because they come in temporal orders. Instead, we need to make sure our training data is before validation data which is before test data temporally. We use last items in sequences as test sets, second-to-last items as validation sets and the rest as training sets. We use ranking metrics such as NDCG@K and Recall@K for evaluations, which are defined in the Appendix.

Our model, which we call SSE-PT, is motivated by the Transformer model in [110] and [56] .

It also utilizes a new regularization technique called stochastic shared embeddings [117] .

In the following sections, we are going to examine each important component of our where d = d u + d i . So each input sequence s i ∈ R T will be represented by the following embedding:

where [v j it ; u i ] represents concatenating item embedding v j it ∈ R d i and user embedding u i ∈ R du into embedding E t ∈ R d for time t. Note that the main difference between our model and [56] is that we introduce the user embeddings u i , making our model personalized.

Transformer Encoder On top of the embedding layer, we have B blocks of selfattention layers and fully connected layers, where each layer extracts features for each time step based on the previous layer's outputs. Since this part is identical to the Transformer [56, 110] , we will skip the details.

Prediction Layer At time t, the predicted probability of user i engaged item l is:

where σ is the sigmoid function and r itl is the predicted score of item l by user l at time point t, defined as:

where F B t−1 is the output hidden units associated with the transformer encoder at the last timestamp. Although we can use another set of user and item embedding look-up tables for the u i and v l , we find it better to use the same set of embedding look-up tables U, V as in the embedding layer. But regularization for those embeddings can be different. To distinguish the u i and v l in (6.4) from u i , v j in (6.2), we call embeddings in (6.4) output embeddings and those in (6.2) input embeddings.

The binary cross entropy loss between predicted probability for the positive item l = j i(t+1) and one uniformly sampled negative item k ∈ Ω is given as −[log(p itl ) + log(1 − p itk )].

Summing over s i and t, we obtain the objective function that we want to minimize is:

At the inference time, top-K recommendations for user i at time t can be made by sorting scores r itl for all items and recommending the first K items in the sorted list.

The most important regularization technique to SSE-PT model is the Stochastic Shared Embeddings (SSE) [117] .

The main idea of SSE is to stochastically replace embeddings with another embedding with some pre-defined probability during SGD, which has the effect of regularizing the embedding layers. Without SSE, all the existing well-known regularization techniques like layer normalization, dropout and weight decay fail and cannot prevent the model from over-fitting badly after introducing user embeddings. [117] develops two versions of SSE, SSE-Graph and SSE-SE. In the simplest uniform case, SSE-SE replaces one embedding with another embedding uniformly with probability p, which is called SSE probability in [117] . Since we don't have knowledge graphs for user or items, we simply apply the SSE-SE to our SSE-PT model. We find SSE-SE makes possible training this personalized model with O(nd u ) additional parameters.

There are 3 different places in our model that SSE-SE can be applied. We can apply SSE-SE to input/output user embeddings, input item embeddings, and output item embeddings with probabilities p u , p i and p y respectively. Note that input user embedding and output user embedding are always replaced at the same time with SSE probability p u .

Empirically, we find that SSE-SE to user embeddings and output item embeddings always helps, but SSE-SE to input item embeddings is only useful when the average sequence length is large, e.g., more than 100 in Movielens1M and Movielens10M datasets.

Other Regularization Techniques Besides the SSE [117] , we also utilized other widely used regularization techniques, including layer normalization [4] , batch normalization [51] , residual connections [39] , weight decay [64] , and dropout [106] . Since they are used in the same way in the previous paper [56] , we omit the details to the Appendix. 

In this section, we compare our proposed algorithms, Personalized Transformer (SSE-PT) and SSE-PT++, with other state-of-the-art algorithms on real-world datasets. We implement our codes in Tensorflow and conduct all our experiments on a server with 40-core Intel Xeon E5-2630 v4 @ 2.20GHz CPU, 256G RAM and Nvidia GTX 1080 GPUs.

Datasets We use 5 datasets. The first 4 have exactly the same train/dev/test splits as in [56] . The datasets are: Beauty and Games categories from Amazon product review datasets 1 ; Steam dataset introduced in [56] , which contains reviews crawled from a large video game distribution platform; Movielens1M dataset [38] , a widely used benchmark datasets containing one million user movie ratings; Movielens10M dataset with ten million user ratings cleaned by us. Detailed dataset statistics are given in Table 8 .7. One can easily see that the first 3 datasets have short sequences (average length ¡ 12) while the last 2 datasets have very long sequences (¿ 10x longer).

Evaluation Metrics The evaluation metrics we use are standard ranking metrics, namely NDCG and Recall for top recommendations (See Appendix). We follow the same evaluation setting as the previous paper [56] : predicting ratings at time point t + 1 given the previous t ratings. For a large dataset with numerous users and items, the evaluation procedure would be slow because (8.14) would require computing the ranking of all items based on their predicted scores for every single user. As a means of speed-up evaluations, we sample a fixed number C (e.g., 100) of negative candidates while always keeping the positive item that we know the user will engage next. This way, both R ij and Π i will be narrowed down to a small set of item candidates, and prediction scores will only be computed for those items through a single forward pass of the neural network.

Ideally, we want both NDCG and Recall to be as close to 1 as possible, because NDCG@K = 1 means the positive item is always put on the top-1 position of the top-K ranking list, and Recall@K = 1 means the positive item is always contained by the top-K recommendations the model makes.

Baselines We include 5 non-deep-learning and 6 deep-learning algorithms in our comparisons.

The simplest baseline is PopRec, basically ranking items according to their popularity. More advanced methods such as matrix factorization based baselines include Bayesian personalized ranking for implicit feedback [91] , namely BPR;

Factorized Markov Chains and Personalized Factorized Markov Chains models [92] also known as FMC and PFMC; and translation based method [40] called TransRec.

Deep-learning Baselines Recent years have seen many advances in deep learning for sequential recommendations. GRU4Rec is the first RNN-based method proposed for this problem [43] ; GRU4Rec + [42] later is proposed to address some shortcomings of the 

initial version. Caser is the corresponding CNN-based method [108] . STAMP [68] utilizes the attention mechanism without using RNN or CNN as building blocks. Very recently,

SASRec utilizes state-of-art Transformer encoder [110] with self-attention mechanisms.

Hierarchical gating networks, also known as HGN [69] are also proposed to solve this problem.

We use the same datasets as in [56] and follow the same procedure in the paper: use last items for each user as test data, second-to-last as validation data and the rest as training data. We implemented our method in Tensorflow and solve it with Adam Optimizer [57] with a learning rate of 0.001, momentum exponential decay rates β 1 = 0.9, β 2 = 0.98 and a batch size of 128. In Table 6 .1, since we use the same data, the performance of previous methods except STAMP have been reported in [56] . We tune the dropout rate, and SSE probabilities p u , p i , p y for input user/item embeddings and 

Apart from evaluating our SSE-PT against SASRec using well-defined ranking metrics on real-world datasets, we also visualize the differences between both methods in terms of their attention mechanisms. In Figure 6 .2, a random user's engagement history in

Movielens1M dataset is given in temporal order (column-wise). We hide the last item whose index is 26 in test set and hope that a temporal collaborative ranking model can figure out item-26 is the one this user will watch next using only previous engagement history. One can see for a typical user; they tend to look at a different style of movies at different times. Earlier on, they watched a variety of movies, including Sci-Fi, animation, thriller, romance, horror, action, comedy and adventure. But later on, in the last two columns of Figure 6 .2, drama and thriller are the two types they like to watch most, especially the drama type. In fact, they watched 9 drama movies out of recent 10 movies.

For humans, it is natural to reason that the hidden movie should probably also be drama 

In [56] , it has been shown that SASRec is about 11 times faster than Caser and 17 times faster than GRU4Rec + and achieves much better NDCG@10 results so we did for Movielens1M dataset. Given that we added additional user embeddings into our SSE-PT model, it is expected that it will take slightly longer to train our model than un-personalized SASRec. We find empirically that training speed of the SSE-PT and SSE-PT++ model are comparable to that of SASRec, with SSE-PT++ being the fastest and the best performing model. It is clear that our SSE-PT and SSE-PT++ achieve much better ranking performances than our baseline SASRec using the same training time.

SSE probability Given the importance of SSE regularization for our SSE-PT model, we carefully examined the SSE probability for input user embedding in Table 8 .10 in Appendix. We find that the appropriate hyper-parameter SSE probability is not very sensitive: anywhere between 0.4 and 1.0 gives good results, better than parameter sharing and not using SSE-SE. This is also evident based on comparison results in Table 6 .3.

Recall that the sampling probability is unique to our SSE-PT++ model. We show in Table 8 .11 in Appendix using an appropriate sampling probability like 0.2 → 0.3 would allow it to outperform SSE-PT when the same maximum length is used.

We find for our SSE-PT model, a larger number of attention blocks is preferred. One can easily see in Table 8 

In this chapter, we propose a novel neural network architecture called Personalized

Transformer for the temporal collaborative ranking problem. It enjoys the benefits of being a personalized model, therefore achieving better ranking results for individual users than the current state-of-the-art. By examining the attention mechanisms during inference, the model is also more interpretable and tends to pay more attention to recent items in long sequences than un-personalized deep learning models. I was exploring directions on incorporating additional side information such as graphs and temporal orderings. In chapter 5, we came up with a new graph encoding method in chapter 2 to enhance existing graph-based collaborative filtering, allowing them to encode deep graph information and therefore achieve better recommendation performances. We made the temporal collaborative ranking model personalized in chapter 7 by incorporating user embeddings. In the process, motivated by the need to prevent over-fitting caused by the additional parameters, we introduced a general regularization technique for embedding layers in deep learning in chapter 6, which was shown to be useful for many other models with lots of embedding parameters both within and outside recommendations.

Despite that the dissertation is lengthy and has contained many important research [6] . But to do that, we need a well-defined problem, dataset and metric, and lots of people participating both from academics and industry by combining strengths from both parties. total nm ratings. We choose T = 3 so the graph contains at most 6-hop information among n users. We use rank r = 50 for both user and item embeddings. We set influence weight w = 0.6, i.e. in each propagation step, 60% of one user's preference is decided by its friends (i.e. neighbors in the friendship graph). We set p = 0.001, which is the probability for each of the possible edges being chosen in Erdõs-Rényi graph G. A small edge probability p, influence weight w < 1.0, and a not too-large T is needed, because we don't want that all users become more or less the same after T propagation steps.

We omit the definitions of RMSE, Precision@k, NDCG@k, MAP as those can be easily found online. HLU: Half-Life Utility [11, 98] is defined as:

where n is the number of users and HLU i is given by:

where R iΠ il follows previous definition, d is the neural vote (usually the rating average), and α is the viewing halflife. The halflife is the number of the item on the list such that there is a 50-50 chance the user will review that item [11] .

Input: n users, m items, rank r, influence weight w, T propagation steps Output: R tr ∈ R n×m , R te ∈ R n×m , G ∈ R n×n 1: Randomly initialize U ∈ R n×r , V ∈ R m×r from standard normal distribution for i = 1, ..., n do

Set U =Ũ 7: Generate rating matrix R = U V T 8: Random sample observed user/item indices in training and test data: Ω tr , Ω te 9: Obtain R tr = Ω tr • R, R te = Ω te • R 10: return rating matrices R tr , R te , user graph G 

To reproduce results reported in the paper, one need to download data (douban and flixster) and third-party C++ Matrix Factorization library from the link https://www. csie.ntu.edu.tw/~cjlin/papers/ocmf-side/. One can simply follow README there to compile the codes in Matlab and run one-class matrix factorization library in different modes (both explicit feedback and implicit feedback works). The advantage of using this library is that the codes support multi-threading and runs quite fast with very efficient memory space allocations. It also supports with graph or other side information. All three methods' baseline can be simply run with the tuning parameters we reported in the As to simulation study, we will also provide python codes to repeat our Algorithm 11

to generate synthesis dataset. One can easily simulate the data before converting into Matlab data format and running the codes as before. The optimal parameters can be found in Table 8 In Table 8 .4, one can compare magnitude of optimal α and β to have a good idea of whether G or G 2 is more useful. G represents shallow graph information and G 2 represents deep graph information. If one already run GRMF G 2 , one can then use this as a preliminary test to decide whether to go deep with DNA-3 (d = 3) to capture deep graph information or simply go ahead with DNA-1 (d = 1) to fully utilize shallow information.

For douban dataset, we have α = 0.05 > 0.0005 = β, which implies shallow information is important and we should fully utilize it. It explains why DNA-1 is performing well both in terms of performance and speed on douban dataset. It is worth noting that GRMF DNA-1's Bloom filter matrix B contains much more nnz than that of G in Table 8 The synthesis dataset has 10, 000 users and 2, 000 items with user friendship graph of size 10, 000×10, 000. Note that the graph only contains at most 6-hop valid information. GRMF G 6 means GRMF with G + α · G 2 + β · G 3 + γ · G 4 + · G 5 + ω · G 6 . GRMF DNA-d means depth d is used. Figure 5 .1 as a custom operator.

To run SSE-Graph, we need to construct good-quality knowledge graphs on embeddings.

We managed to match movies in Movielens1m and Movielens10m datasets to IMDB websites, therefore we can extract plentiful information for each movie, such as the cast of the movies, user reviews and so on. For simplicity reason, we construct the knowledge graph on item-side embeddings using the cast of movies. Two items are connected by an edge when they share one or more actors/actresses. For user side, we do not have good quality graphs: we are only able to create a graph on users in Movielens1m dataset based on Algorithm 12. Compute gradient for V when U fixed Input: Π, U , V , λ, ρ

Output: g g ∈ R r×m is the gradient for f (V )

For implicit feedback, it should be (1 + ρ) ·m instead ofm, since ρ ·m 0's are appended to the back Initialize total = 0, tt = 0

Return g their age groups but we do not have any side information on users in Movielens10m dataset.

When running experiments, we do a parameter sweep for weight decay parameter and then fix it before tuning the parameters for SSE-Graph and SSE-SE. We utilize different ρ and p for user and item embedding tables respectively. The optimal parameters are stated in In the second leg of experiments, we remove the constraints on the maximum number of ratings per user. We want to show that SSE-SE can be a good alternative when graph information is not available. We follow the same procedures in [115, 116] . In Table 5 .2,

we can see that SSE-SE can be used with dropout to achieve the smallest RMSE across Douban, Movielens10m, and Netflix datasets. In Table 5 .3, one can see that SSE-SE is more effective than dropout in this case and can perform better than STOA listwise approach SQL-Rank [116] on 2 datasets out of 3.

In Table 5 .2, SSE-SE has two tuning parameters: probability p u to replace embeddings associated with user-side embeddings and probability p i to replace embeddings associated with item side embeddings because there are two embedding tables. But here for simplicity, we use one tuning parameter p s = p u = p i . We use dropout probability of p d , dimension of user/item embeddings d, weight decay of λ and learning rate of 0.01 for all experiments, with the exception that the learning rate is reduced to 0.005 when both SSE-SE and Dropout are applied. For Douban dataset, we use d = 10, λ = 0.08. For Movielens10m and Netflix dataset, we use d = 50, λ = 0.1.

We use the transformer model [110] as the backbone for our experiments. The control group is the standard transformer encoder-decoder architecture with self-attention. In the experiment group, we apply SSE-SE towards both encoder and decoder by replacing corresponding vocabularies' embeddings in the source and target sentences. We trained on the standard WMT 2014 English to German dataset which consists of roughly 4.5

million parallel sentence pairs and tested on WMT 2008 to 2018 news-test sets. Sentences were encoded into 32,000 tokens using a byte-pair encoding. We use the SentencePiece,

OpenNMT and SacreBLEU implementations in our experiments. We trained the 6-layer 

In the first leg of experiments, we crawled one million user reviews data from IMDB and pre-trained the BERT-Base model (12 blocks) for 500, 000 steps using sequences of maximum length 512 and batch size of 8, learning rates of 2e −5 for both models using one NVIDIA V100 GPU. Then we pre-trained on a mixture of our crawled reviews and reviews in IMDB sentiment classification tasks (250K reviews in train and 250K reviews in test) for another 200, 000 steps before training for another 100, 000 steps for the reviews in IMDB sentiment classification task only. In total, both models are pre-trained on the same datasets for 800, 000 steps with the only difference being our model utilizes SSE-SE.

In the second leg of experiments, we fine-tuned the two models obtained in the first-leg experiments on two sentiment classification tasks: IMDB sentiment classification task and SST-2 sentiment classification task. The goal of pre-training on IMDB dataset but fine-tuning for SST-2 task is to explore whether SSE-SE can play a role in transfer learning.

The results are summarized in Table 5 .6 for IMDB sentiment task. In experiments, we use maximum sequence length of 512, learning rate of 2e −5 , dropout probability of 0.1 and we run fine-tuning for 1 epoch for the two pre-trained models we obtained before.

For the Google pre-trained BERT-base model, we find that we need to run a minimum of 2 epochs. This shows that pre-training can speed up the fine-tuning. We find that

Google pre-trained model performs worst in accuracy because it was only pre-trained on Wikipedia and books corpus while ours have seen many additional user reviews. We also find that SSE-SE pre-trained model can achieve accuracy of 0.9542 after fine-tuning for one epoch only. On the contrast, the accuracy is only 0.9518 without SSE-SE for embeddings associated with output y i .

For the SST-2 task, we use maximum sequence length of 128, learning rate of 2e −5 , dropout probability of 0.1 and we run fine-tuning for 3 epochs for all 3 models in Table 5 .7.

We report AUC, accuracy and F1 score for dev data. For test results, we submitted our predictions to Glue website for the official evaluation. We find that even in transfer learning, our SSE-SE pre-trained model still enjoys advantages over Google pre-trained model and our pre-trained model without SSE-SE. Our SSE-SE pre-trained model achieves 94 .3% accuracy on SST-2 test set versus 93.6 and 93.8 respectively. If we are using SSE-SE for both pre-training and fine-tuning, we can achieve 94.5% accuracy on the SST-2 test set, which approaches the 94.9 score reported by the BERT-Large model. SSE probability of 0.01 is used for fine-tuning.

Throughout this section, we will suppress the probability parameters, p(., .|Φ) = p(., .). Simulation of a bound on ρ L,n for the movielens1M dataset. Throughout the simulation, L is replaced with (which will bound ρ L,n by Jensen's inequality). The SSE probability parameter dictates the probability of transitioning. When this is 0 (box plot on the right), the distribution is that of the samples from the standard Rademacher complexity (without the sup and expectation). As we increase the transition probability, the values for ρ L,n get smaller.

Let us break the variability term into two components Notice that we may write,

Again we may introduce a second set of Rademacher random variables σ i , which results in Then the result follows from McDiarmid's inequality.

• NDCG@K: defined as:

where i represents i-th user and DCG@K(i, Π i ) = K l=1 2 R iΠ il − 1 log 2 (l + 1) . (8.15) In the DCG definition, Π il represents the index of the l-th ranked item for user i in test data based on the learned score matrix X. R is the rating matrix and R ij is the rating given to item j by user i. Π * i is the ordering provided by the ground truth rating.

• Recall@K: defined as a fraction of positive items retrieved by the top K recommendations the model makes:

here we already assume there is only a single positive item that user will engage next and the indicator function 1{∃1 ≤ l ≤ k : R iΠ il = 1} is defined to indicate whether the positive item falls into the top K position in our obtained ranked list using scores predicted in (6.4).

Layer Normalization Layer normalization [4] normalizes neurons within a layer. Previous studies [4] show it is more effective than batch normalization for training recurrent neural networks (RNNs). One alternative is the batch normalization [51] but we find it does not work as well as the layer normalization in practice even for a reasonable large batch size of 128. Therefore, our SSE-PT model adopts layer normalization.

Residual Connections Residual connections are firstly proposed in ResNet for image classification problems [39] . Recent research finds that residual connections can help training very deep neural networks even if they are not convolutional neural networks [110] .

Using residual connections allows us to train very deep neural networks here. For example, the best performing model for Movielens10M dataset in Table 8 .12 is the SSE-PT with 6 attention blocks, in which 1 + 6 * 3 + 1 = 20 layers are trained end-to-end.

Weight Decay Weight decay [64] , also known as l 2 regularization [46] , is applied to all embeddings, including both user and item embeddings.

Dropout Dropout [106] is applied to the embedding layer E, self-attention layer and pointwise feed-forward layer by stochastically dropping some percentage of hidden units to prevent co-adaption of neurons. Dropout has been shown to be an effective way of regularizing deep learning models.

In summary, layer normalization and dropout are used in all layers except prediction layer. Residual connections are used in both self-attention layer and pointwise feed-forward layer. SSE-SE is used in embedding layer and prediction layer. • PopRec: ranking items according to their popularity.

• BPR: Bayesian personalized ranking for implicit feedback setting [91] . It is a low-rank matrix factorization model with a pairwise loss function. But it does not utilize the temporal information. Therefore, it serves as a strong baseline for non-temporal methods.

• FMC: Factorized Markov Chains: a first-order Markov Chain method, in which predictions are made only based on previously engaged item.

• PFMC: a personalized Markov chain model [92] that combines matrix factorization and first-order Markov Chain to take advantage of both users' latent long-term preferences as well as short-term item transitions.

• TransRec: a first-order sequential recommendation method [40] in which items are embedded into a transition space and users are modelled as translation vectors SQL-Rank [116] and item-based recommendations [94] are omitted because the former is similar to BPR [91] except using the listwise loss function instead of the pairwise loss function and the latter has been shown inferior to TransRec [40] .

• GRU4Rec: the first RNN-based method proposed for the session-based recommendation problem [43] . It utilizes the GRU structures [20] initially proposed for speech modelling.

• GRU4Rec + : follow-up work of GRU4Rec by the same authors: the model has a very similar architecture to GRU4Rec but has a more complicated loss function [42] .

• Caser: a CNN-based method [108] which embeds a sequence of recent items in both time and latent spaces forming an 'image' before learning local features through horizontal and vertical convolutional filters. In [108] , user embeddings are included in the prediction layer only. On the contrast, in our Personalized Transformer, user embeddings are also introduced in the lowest embedding layer so they can play an important role in self-attention mechanisms as well as in prediction stages.

• STAMP: a session-based recommendation algorithm [68] using attention mechanism.

[68] only uses fully connected layers with one attention block that is not self-attentive.

• SASRec: a self-attentive sequential recommendation method [56] motivated by Transformer in NLP [110] . Unlike our method SSE-PT, SASRec does not incorporate user embedding and therefore is not a personalized method. SASRec paper [56] also does not utilize SSE [117] for further regularization: only dropout and weight decay are used.

• HGN: hierarchical gating networks method to solve the sequential recommendation problem [69] , which incorporates the user embeddings and gating networks for better personalization than the SASRec model. 

In this dissertation, we cover some recent advances in collaborative filtering and ranking.

In chapter 1, we give a brief introduction of the history and the current landscape of collaborative filtering and ranking; chapter 2 we first talk about pointwise collaborative filtering problem with graph information, and how our proposed new method can encode very deep graph information which helps four existing graph collaborative filtering algorithms; chapter 3 is on the pairwise approach for collaborative ranking and how we speed up the algorithm to near-linear time complexity; chapter 4 is on the new listwise approach for collaborative ranking and how the listwise approach is a better choice of loss for both explicit and implicit feedback over pointwise and pairwise loss; chapter 5 is about the new regularization technique Stochastic Shared Embeddings (SSE) we proposed for embedding layers and how it is both theoretically sound and empirically effectively for 6 different tasks across recommendation and natural language processing; chapter 6 is how we introduce personalization for the state-of-the-art sequential recommendation model with the help of SSE, which plays an important role in preventing our personalized model from overfitting to the training data; chapter 7, we summarize what we have achieved so far and predict what the future directions can be; chapter 8 is the appendix to all the chapters.

A recommender system based on local random walks and spectral methods

Ranking on graph data

Scalable bloom filters

Collaborative ranking

The netflix prize

The netflix prize

Graph convolutional matrix completion

Space/time trade-offs in hash coding with allowable errors

Apache hadoop goes realtime at facebook

Empirical analysis of predictive algorithms for collaborative filtering

Network applications of bloom filters: A survey

Graph regularized nonnegative matrix factorization for data representation

Grarep: Learning graph representations with global structural information

Learning to rank: from pairwise approach to listwise approach

Bigtable: A distributed storage system for structured data

Efficient algorithms for ranking with svms

Stochastic training of graph convolutional networks with variance reduction

Matrix completion with noisy side information

Empirical evaluation of gated recurrent neural networks on sequence modeling

Robust bloom filters for large multilabel classification tasks

Binaryconnect: Training deep neural networks with binary weights during propagations

Convolutional neural networks on graphs with fast localized spectral filtering

Item-based top-n recommendation algorithms

Pretraining of deep bidirectional transformers for language understanding

Adaptive subgradient methods for online learning and stochastic optimization

A new data structure for cumulative frequency tables. Software: Practice and Experience

An efficient boosting algorithm for combining preferences

Large-scale matrix factorization with distributed stochastic gradient descent

Deep learning

Itemrank: A random-walk based scoring algorithm for recommender engines

node2vec: Scalable feature learning for networks

Preference completion from partial rankings

Matchin: eliciting user preferences with an online game

Inductive representation learning on large graphs

Representation learning on graphs: Methods and applications

Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding

The movielens datasets: History and context

Deep residual learning for image recognition

Translation-based recommendation

Neural collaborative filtering

Recurrent neural networks with topk gains for session-based recommendations

Linas Baltrunas, and Domonkos Tikk. Session-based recommendations with recurrent neural networks

Recommending and evaluating choices in a virtual community of use

Improving neural networks by preventing co-adaptation of feature detectors

Ridge regression: Biased estimation for nonorthogonal problems

Universal language model fine-tuning for text classification

Pu learning for matrix completion

Collaborative filtering for implicit feedback datasets

Listwise collaborative filtering

Batch normalization: Accelerating deep network training by reducing internal covariate shift

Trustwalker: a random walk model for combining trust-based and item-based recommendation

Optimizing search engines using clickthrough data

Optimizing search engines using clickthrough data

Fism: factored item similarity models for top-n recommender systems

Self-attentive sequential recommendation

Adam: A method for stochastic optimization

Semi-supervised classification with graph convolutional networks

Factorization meets the neighborhood: a multifaceted collaborative filtering model

Collaborative filtering with temporal dynamics

Matrix factorization techniques for recommender systems

Retargeted matrix factorization for collaborative filtering

Imagenet classification with deep convolutional neural networks

A simple weight decay can improve generalization

Generalization analysis of listwise learning-to-rank algorithms

Dbpedia-a large-scale, multilingual knowledge base extracted from wikipedia

Factorization meets the item embedding: Regularizing matrix factorization with item co-occurrence

Stamp: short-term attention/memory priority model for session-based recommendation

Hierarchical gating networks for sequential recommendation

Recommender systems with social regularization

Distributed representations of words and phrases and their compositionality

Wordnet: a lexical database for english

Probabilistic matrix factorization

Geometric matrix completion with recurrent multi-graph neural networks

Simplifying neural networks by soft weight-sharing

The pagerank citation ranking: Bringing order to the web

An efficient algorithm for learning to rank from preference graphs

One-class collaborative filtering

Gbpr: Group preference based bayesian personalized ranking for one-class collaborative filtering

Transfer learning for behavior ranking

Preference completion: Large-scale collaborative ranking from pairwise comparisons

On the difficulty of training recurrent neural networks

Glove: Global vectors for word representation

Deepwalk: Online learning of social representations

A call for clarity in reporting BLEU scores

An item/user representation for recommender systems based on bloom filters

Improving language understanding by generative pre-training

Language models are unsupervised multitask learners

Collaborative filtering with graph information: Consistency and scalable methods

Factorization machines

Bpr: Bayesian personalized ranking from implicit feedback

Factorizing personalized markov chains for next-basket recommendation

Restricted boltzmann machines for collaborative filtering

Item-based collaborative filtering recommendation algorithms

Collaborative filtering recommender systems

Getting deep recommenders fit: Bloom embeddings for sparse binary input/output networks

Gossip algorithms. Foundations and Trends® in Networking

Mining recommendations from the web

Hash kernels for structured data

List-wise learning to rank with matrix factorization for collaborative filtering

User based collaborative filtering using bloom filter with mapreduce

Goaldirected inductive matrix completion

Relational learning via collective matrix factorization

Maximum-margin matrix factorization

Maximum-margin matrix factorization

Dropout: a simple way to prevent neural networks from overfitting

Rethinking the inception architecture for computer vision

Personalized top-n sequential recommendation via convolutional sequence embedding

Regression shrinkage and selection via the lasso

Attention is all you need

Collaborative ranking with 17 parameters

Irgan: A minimax game for unifying generative and discriminative information retrieval models

Cofi rank-maximum margin matrix factorization for collaborative ranking

Maximum margin matrix factorization for collaborative ranking

Large-scale collaborative ranking in near-linear time

Sql-rank: A listwise approach to collaborative ranking

Stochastic shared embeddings: Data-driven regularization of embedding layers

Temporal collaborative ranking via personalized transformer

Graph dna: Deep neighborhood aware graph encoding for collaborative filtering

Collaborative denoising auto-encoders for top-n recommender systems

Listwise approach to learning to rank: theory and algorithm

Edge-weighted personalized pagerank: breaking a decade-old performance barrier

Graph convolutional neural networks for web-scale recommender systems

Selection of negative samples for one-class matrix factorization

A scalable asynchronous distributed algorithm for topic modeling

A unified algorithm for one-class structured matrix factorization with side information

Social computing data repository at ASU

Kernelized probabilistic matrix factorization: Exploiting graphs and side information

In the simulation we carried out, we set the number of users n = 10, 000 and the number of items m = 2, 000. We uniformly sample 5% for training and 2% for testing out of the