Mapping Preferences into Euclidean Space Oscar Luacesa, Jorge Dı́eza,∗, Thorsten Joachimsb, Antonio Bahamondea,b aUniversidad de Oviedo, Artificial Intelligence Center, Gijón, Asturias, Spain bCornell University, Department of Computer Science, Ithaca, NY, USA Abstract Understanding and modeling human preferences is one of the key problems in applications ranging from marketing to automated recommendation. In this paper, we focus on learning and analyzing the preferences of consumers regarding food products. In particular, we explore machine learning methods that embed consumers and products in an Euclidean space such that their relationship to each other models consumer preferences. In addition to pre- dicting preferences that were not explicitly stated, the Euclidean embedding enables visualization and clustering to understand the overall structure of a population of consumers and their preferences regarding the set of products. Notice that consumers’ clusters are market segments, and products clusters can be seen as groups of similar items with respect to consumer tastes. We explore two types of Euclidean embedding of preferences, one based on inner products and one based on distances. Using a real world dataset about con- sumers of beef meat, we find that both embeddings produce more accurate models than a tensorial approach that uses a SVM to learn preferences. The ∗Corresponding author: Tel: +34 985 182 588 Email addresses: oluaces@uniovi.es (Oscar Luaces), jdiez@uniovi.es (Jorge Dı́ez), tj@cs.cornell.edu (Thorsten Joachims), abahamonde@uniovi.es (Antonio Bahamonde) Preprint submitted to Expert Systems with Applications July 15, 2015 reason is that the number of parameters to learned in embeddings can be considerably lower than in the tensorial approach. Furthermore, we demon- strate that the visualization of the learned embeddings provides interesting insights into the structure of the consumer and product space, and that it provides a method for qualitatively explaining consumer preferences. Addi- tionally, it is important to emphasize that the approach presented here is flexible enough to allow its use with different levels of knowledge about con- sumers or products; therefore the application field is very wide to grasp an accurate understanding of consumers’ preferences. Keywords: Preference Learning, matrix factorization, graphical representations, learning to order, visualization 1. Introduction In 1927, Thurstone (1927) presented a law of comparative judgment to approach to qualitative comparisons from a psychological point of view. Ac- cording to this law, users react differently to each item, and they identify the degree of compatibility with the quality to be compared. The difference of these degrees define the discriminal process between pairs of items. From a Machine Learning perspective, there are two main ways to ap- proach preferences. They can be represented by a real-valued function, which assigns a utility value to the object, or by a preference relation, which com- pares two different items; see (Hüllermeier and Fürnkranz, 2013). In the first approach, the degrees of compatibility (usually called utilities) are considered as the target output that can be learned by means of ordinal regression or 2 classification methods. This is a suitable approach when it is possible to assume that users assign those utilities depending exclusively on the item being assessed. However, in some cases there is a batch effect; that is, the assessment of an item depends on the batch of items included in the same comparison. When this is the case, it is more suitable to use the second approach and consider preferences as a binary relation. The goal here is to learn the relative ordering of items given by the user instead of the utility itself. This is the approach used by Herbrich et al. (1999), Joachims (2002), Bahamonde et al. (2007) and Rendle et al. (2009). In this paper we are concerned with learning preferences expressed by consumers of a kind of products. Consumers typically assign the utilities only as a way to express relative preferences instead of absolute values. Therefore, the datasets that we are going to use are collections of pairwise comparisons, called preference judgments, that represent the discriminant process of one consumer between two products. In addition to modeling the products, we also explicitly model individual consumers. To represent the interaction of consumers and products we propose several factorization approaches and a tensorial approach. A desirable property of factorization approaches is that they entail an embedding of both consumers and products in a common Euclidean space where the utility can be expressed in geometric (or graphical) terms. The con- sequence is that, as a side effect, learning preferences with these approaches provides a setting for visualization of clusters in both consumers and prod- ucts. In the following sections we introduce a common framework to learn fac- 3 torizations and SVM tensorial models. The purpose is to discuss the char- acteristics of these approaches according only to their mathematical formu- lations. In all cases, the objects involved in preferences (consumers and products) can be represented by a combination of a binary identification code or by vectors of feature-values. Notice that, for instance, in food products, the features of consumers or the products are not always available. Moreover, if a food industry is planing to launch a new product there is a reduced set of options that they want to test, and they can be just represented by an identification code. On the other hand, when there is a selected panel of singular consumers, they can be unequivocally identified with a label. After the formal presentation of the methods, we show the results of an exhaustive experimentation using a real world dataset of consumers of beef meat. First we compare the results of factorization methods with those achieved by SVM. In the datasets used in this paper, factorization methods outperform SVM, probably this is a general fact. One reason is that the number of parameters to be learned is smaller in the factorization approaches. Additionally, the formal models learned in all cases are quite similar and they all capture the possible interactions of both the features of consumers and products. The contributions of the paper are the following: (i) it presents a com- mon framework for different approaches to learn preferences using matrix factorization, (ii) the paper illustrates the formal presentation with a real world problem of consumers and (food) products, (iii) the last section shows a favorable comparison of factorization and SVM tensorial approaches, (iv) 4 the paper emphasizes the graphical and geometrical possibilities of the fac- torization methods as a tool to analyze the complex relationships of users and items. 2. Related Work Learning preferences has been studied with different approaches. A recent Special Issue of the Machine Learning Journal (Hüllermeier and Fürnkranz, 2013) includes some interesting approaches and application fields. From a conceptual point of view, the aim is to learn an ordering relation from some pairwise comparisons. Thus, the learning task can be read as a binary classification task using SVMs, see for instance (Herbrich et al., 1999; Joachims, 2002). In this paper we adopt a more general strategy, we explicitly optimize a loss function with regularization. For instance, it is possible to use the lo- gistic loss as in the learner proposed by Rendle et al. (2009). The algorithm presented in that paper was derived from a Bayesian analysis of the ranking problem of a user for a set of items. The algorithm is called Bayesian Person- alized Ranking (BPR) and was devised as a method to solve the maximum posterior estimation. We present a general setting that includes, at the same time, a factoriza- tion framework and a tensorial approach: a SVM that uses tensor products to model the interactions of consumer and item representations. Tensorial representations were already used, for instance, in (Basilico and Hofmann, 2004; Rendle and Schmidt-Thieme, 2010; Pahikkala et al., 2012). We report a comparison between factorization methods and this SVM approach in the 5 experimental section using a real dataset. Factorization algorithms were previously used in recommender systems in some of the best ranked systems of the Netflix prize; see for instance (Koren et al., 2009). Many other papers propose matrix factorization for solving specific problems in recommender systems; see for instance (Ocepek et al., 2015). A software library to do factorization machines with a wide variety of options is presented in (Rendle, 2012). Other similar implementations can be found in (Chen et al., 2011; Agarwal and Chen, 2009; Bayer, 2015). Let us remark that the goal of this paper is not to present another implemen- tation. We use a quite straightforward SGD (Stochastic Gradient Descent) implementation whose main advantage is that it is the same for 3 different approaches. We want to underscore that the differences arise only from the formulation of the approach, but not from any implementation issue. Ad- ditionally, we are interested in discussing the necessity of using the features of the items and consumers involved in the preferences judgments. This is a central point in learning preferences and the approach presented in this paper is quite suitable for this purpose. We may use all the convenient information in a real word application. Another interesting use of factorizations is presented in (Weston et al., 2010, 2011). The target is information retrieval, and so the aim was to optimize the ranking of labels attached to queries (images or music). In this case the output is an ordered set of labels. As was said in the introduction, we are concerned with the graphical properties of the model learned from preferences. Both consumers and items 6 are located in an Euclidean space with one specific aim. This is the case, for instance in (Moore et al., 2012; Chen et al., 2012) where the purpose was to build an embedding of songs. The proximity was learned from a collection of playlists. Once the map is built, playlists are generated using the relative distances of songs. To model proximity, the authors use Gaussian distributions, and the same tool was used also to add tags to the songs. In this paper we do not assume any distribution of the representation of data in the Euclidean space. Another paper related to the work reported here is (Xing et al., 2002). Here, the authors learn a metric for Euclidean points that represents sim- ilarities and dissimilarities. The metric is given by a positive semi-definite matrix that is the solution of a convex optimization problem. In our case the factorization eases the learning process since the number of parameters to be estimated may be significantly fewer. Moreover, the preference learn- ing tasks only have dissimilarity examples; there are no similarity cases to guide the induction. On the other hand, in (Xing et al., 2002) there is no reduction of dimensionality neither visualization purposes: the objective is to find clusters. A quite similar approach can be found in (Parameswaran and Weinberger, 2010), in this case to learn a metric for Multi-Task Learning. To learn metrics is also the aim of Peltonen et al. (2003). The purpose is to reduce the dimensionality from visualization. The source data are collections of labelled data for classification tasks. The proposals are extensions of the so-called Self-Organizing Maps (SOM). However, notice that our purpose is not only to learn a metric, but to learn preferences while represent in a metric space both consumers and products. 7 Finally, the visualization method presented here is in fact a supervised learning algorithm, like supervised PCA (Koren and Carmel, 2004; Yu et al., 2006; Du et al., 2015) for instance. The difference is that our approach explicitly incorporates the loss function and the definition of similarity that we want to obtain at the end of the process. And, of course, the method presented here is devised for learning preferences. 3. Formal Framework Let us consider the following dataset D = {(x1,f(x1)), . . . , (xn,f(xn))}. (1) Here we assume that f is an unknown real function on the space from where inputs x ∈ Rm are drawn. The aim is to find a new function g of input data x, that depends also on some parameters θ, such that the variations of f can be predicted by the variations of g. The function g will have an analytical definition that makes it straightforward to compute g on any input. In symbols, the aim of g is to maximize the probability Pr ( f(x) > f(x′) ⇐⇒ g(x,θ) > g(x′,θ) ) . (2) In the following, as usual in this context, we will call g a utility function. To learn g we define the following ordering induced by D Dor = { ( xi,xj; [[f(xi) > f(xj)]] ) : i,j = 1, . . . ,n}. (3) The symbol [[p]] stands for the value 1 when the predicate p is true, and −1 otherwise. In the next subsections we present an approach to learn g from this binary classification task. 8 Formally, the learning process of the parameters θ of g starts with the dataset Dor (Eq. 3). Soon we shall see that we may use only the examples of the positive class, D+or = {(xi,xj) :f(xi) > f(xj), i,j = 1, . . . ,n}. (4) Notice that, in fact, we do not need the function f in our approach. In practice, f is hidden and we do not have access to it; otherwise, the dataset (Eq. 1) could be seen as a regression task. Roughly speaking, the dataset D+or is the set of pairs where an explicit ordering has been registered. Each pair is formed by the better and the worse objects. Usually these pairs are called preference judgments, see (Joachims, 2002). We adopted a margin maximization approach, detailed in the next sec- tion, to learn from such a dataset in order to include the hypothesis learned by SVMs as in (Herbrich et al., 1999; Joachims, 2002; Bahamonde et al., 2007; Basilico and Hofmann, 2004; del Coz et al., 2005; Dı́ez et al., 2005, 2006). This could also be solved using a probabilistic approach. 4. Maximum Margin Approach As usual, we assume that all these examples are independently and iden- tically drawn (i.i.d.) from an unknown distribution. Thus, using a maximum margin approach, the parameters θ should minimize Loss(θ,D+or) = ∑ (xi,xj)∈D+or max ( 0, 1 −g(xi,θ) + g(xj,θ) ) . (5) Following Rendle et al. (2009), margin maximization can be done using an SGD algorithm (Robbins and Monro, 1951) with a regularization term 9 for the parameter θ, r(θ). Thus, the optimal value, θ∗ is given by θ∗ = argmin θ Loss(θ) + νr(θ) (6) The idea is to ensure that the difference of the utilities in a preference judgment is at least 1. g(xi,θ) −g(xj,θ) ≥ 1, (xi,xj) ∈ D+or Of course, this is equivalent to g(xj,θ) −g(xi,θ) < −1, (xi,xj) ∈ D−or where D−or is the subset of Dor (Eq. 3) with negative classes. The consequence is that we may get rid of the negative part since it is redundant. The corresponding optimization with this loss function can be solved with Algorithm 1 that implements this approach using an L2 regularization term. The updating step due to (xi,xj) is done by: θ ← θ −γ [∂(Loss(θ))ij ∂θ + ν ∂r(θ) ∂θ ] . (7) That is, θ ← θ + γ [∂g(xi,θ) ∂θ − ∂g(xj,θ) ∂θ −ν ∂r(θ) ∂θ ] (8) if 1 −g(xi,θ) + g(xj,θ) > 0. Additionally, to ensure numerical stability, following Weston et al. (2010, 2011), we use a parameter R (a radius) such that the size of θ is always smaller or equal than R. 10 Algorithm 1 SGD algorithm to learn a utility function that maximizes the margin as defined in (Eq. 5) using an L2 regularization Input: D+or; {(Eq. 4)} Input: γ > 0 {learning rate}; ν > 0 {regularization parameter}; Input: R > 0 {radius}; assign random values to the components of θ; repeat fetch random (xbetter,xworse) ∈ D+or; if 1 > g(xbetter,θ) −g(xworse,θ) then θ ← θ + γ [ ∂g(xbetter,θ) ∂θ − ∂g(xworse,θ) ∂θ −ν∂r(θ) ∂θ ] ; if ∥∥θ∥∥ > R then θ ← R∥∥θ∥∥θ; end if end if until stop criterion 5. Factorization and Tensorial Approaches In the last section, inputs were described by a generic vector x and the aim was to emphasize the ordering of these vectors according to f values. Now we are going to get into the structure of inputs as the concatenation of two different vectors, the representation of consumers and items or products (we prefer products to use p instead of i for short in equations). Thus, in the following, we are going to assume that each input data can be split in two parts: x = (c,p). 11 In this section, we introduce three possible definitions of the utility func- tion g. They have in common that rest on the interaction of the vectorial representation of consumers and products. 5.1. Mapping Consumers and Products: Matrix Factorization In this subsection, we are going to consider an embedding of both con- sumers and products in a common Euclidean space. Then, the function g (Eq. 2) will be defined in terms of the mappings in the common space. We assume that consumers are described by vectors in a Euclidean space of dimension |Con|, while products are given by vectors with |Prod| compo- nents. We are going to represent them in a common space of dimension k using two linear maps given respectively by matrices W and V . R|Con| −→ Rk, c W c, (9) R|Prod| −→ Rk, p V p. (10) Let us remark that, as usual, we are considering vectors as column matrices. In this context, the parameter θ to be learned is the set of matrices W , V . We are trying to solve the optimization problem W∗,V ∗ = argmin W,V ( Loss(W ,V ,D+or) + νr(W ) + νr(V ) ) . (11) Notice that there are different options to define the interaction of con- sumers and products. Next, we present two of them. 12 5.1.1. Inner Products The first alternative is to formalize the interactions by the following inner product of the mappings of consumers and products in Rk. gin(x) = gin(c,p) = 〈W c,V p〉 (12) =(W c)TV p = cTW TV p = |Con|∑ r=1 |Prod|∑ s=1 ( W TV ) rs ( cpT ) rs =(V p)TW c = |Con|∑ r=1 |Prod|∑ s=1 ( V TW ) rs ( pcT ) rs . It is interesting to realize that the utility function gin is given by a linear combination of all products formed by one component from the consumer description, and one component of the description of the product. The type of equation is different if we add one constant component (with value 1 for instance) to the vectorial representation of consumers and prod- ucts; that is, cT ← [cT 1]; pT ← [pT 1]. (13) In this case, the utility function can be thought as follows, gin(c, p) = ∑ r,s αrscrps + ∑ r βrcr + ∑ s δsps + τ, (14) for some real coefficients αr,s,βr,δs and τ. That is, the utility is a polynomial of degree 2 where the monomials of degree 2 are always built by the product of one component of the representation of consumers and other from the products. If we compare the equations (Eq. 12, 14), we appreciate that the coeffi- cients of the polynomial that defines gin are factorized in two matrices, as was mentioned in the introduction. 13 The partial derivatives needed to implement this approach in Algorithm 1 are the following: ∂gin(c,p) ∂W = ∂(V p)T (W c) ∂W = V pcT , ∂gin(c,p) ∂V = ∂(W c)T (V p) ∂V = W cpT . On the other hand, we use the square of the Frobenius norm as the regularization summand. r(W ) = ∥∥W∥∥2 F = Tr(W T W ), r(V ) = ∥∥V ∥∥2 F = Tr(V T V ). Therefore, the regularization derivatives are ∂Tr(W TW ) ∂W = 2W , ∂Tr(V TV ) ∂V = 2V . (15) The Frobenius norm of matrices is also used to measure the size of the pa- rameters in Algorithm 1. 5.1.2. Euclidean Closeness The second option that we explore for defining g is the interaction given by the closeness. In symbols, we define gcl(c,p) = − ∥∥W c−V p∥∥2 = − ∥∥W c∥∥2 −∥∥V p∥∥2 + 2〈W c, V p〉 = − ∥∥W c∥∥2 −∥∥V p∥∥2 + 2gin(c,p). (16) Notice that, comparing with the utility function defined in (Eq. 12), now we add more summands to the equation. The new utility, gcl, includes the 14 weighted sum of all monomials of degree 2 formed with variables taken from the description of consumers (c) or products (p). Of course, to guarantee this, we need to add one constant component (with value 1 for instance) to the vectorial representation of consumers and products, see (Eq. 13). The derivatives needed to implement the learning algorithm are the fol- lowing: ∂gcl(c,p) ∂W = − ∂(W c)T (W c) ∂W +2 ∂gin(c,p) ∂W = −W (2ccT )+2V pcT, ∂gcl(c,p) ∂V = − ∂(V p)T (V p) ∂V +2 ∂gin(c,p) ∂V = −V (2ddT )+2W cpT. We use the same regularization than in the case of the utility defined in terms of the inner product. The advantage of this definition of g is that the visual semantics is more easy to appreciate. The Euclidean representation of consumers and products are closed or further according with the preferences. The inner product is a simpler equation, but it is harder to visualize. 5.2. Tensor Product The full description of the utility functions presented in the last subsection (Eq. 14) can be seen as a particular case of a linear function in the tensor product of consumers and products. In symbols, g⊗(c, p) = 〈w,c⊗p〉, (17) where w is a vector in the Euclidean space of dimension |Con|×|Prod|. If we use a pair of indexes to refer to the components of w, the previous equation 15 can be written as g⊗(c, p) = ∑ r,s wrscrps. (18) Once more, this expression includes all the terms of (Eq. 14) provided that a constant component is included in vectors c and p. It is important to emphasize that the number of parameters to be learned in this approach is considerably more than in the factorization cases provided that the value of k (the dimension of the Euclidean space is small). Again, we may learn these parameters, the components of w, using the Algorithm 1. For this purpose, we only need to compute the derivative ∂g⊗(c, p) ∂w = c⊗p, (19) and the derivative of the regularization summand, that is given by ∂r(w) ∂w = 2w. (20) The Algorithm 1 learns the parameter w of (Eq. 17) using a SGD. The size of the parameter is the Euclidean norm of w. This approach is then equivalent to a Support Vector Machine (SVM) used to learn to rank. In the experimental section we will denote this learning algorithm as SVM⊗. 6. Experimental Results In this section we report a set of experiments carried out to show the performance of the proposals of this paper. First we present some imple- mentation details of Algorithm 1. Then we introduce the datasets used in the experiments to report the accuracy obtained with each utility function (let us recall that in all cases the learning algorithm is the same). Finally, 16 we show some graphical representations obtained as side effect of the learn- ing process to illustrate the visualization possibilities of the factorization approaches when used to learn consumer preferences. 6.1. Implementation Details The implementation of the Algorithm 1 was done using Pegasos as a model, see (Shalev-Shwartz et al., 2011). Thus, the learning rate follows the equation γ = γ0 1 + γs(n− 1) . To avoid many parameters to be adjusted, we fixed γ0 = 1, and γs = 0.01. The radius (Section 4) was also fixed: R = 1. As usual, n is the ordinal of the iteration. To update the model learned by the algorithm we used a mini batch strategy, averaging the updates every time that 10% of the training examples were processed. The only adjustable parameter in the algorithm was the regularization parameter. We made an internal grid search to determine the best option in the set ν ∈{10i : i = −1, . . . ,−10} using a 2-fold cross validation repeated 3 times on the training set. Finally, the algorithm stops when the size of the difference of parameter θ in two consecutive iterations is smaller than 10−6 or the number of iterations is 5000. 17 Task |D+or| Acceptability 3084 Flavor 3080 Tenderness 3313 Table 1: Sizes of the datasets used in the experiments. |D+or| stands for the number of Preference Judgments (Eq. 4). The number of consumers is 392 and there are 307 items 6.2. Datasets The dataset used in this paper comes from a study carried out to deter- mine the features that entail consumer acceptance of beef meat from seven Spanish breeds (Gil et al., 2001; Sañudo et al., 2004; del Coz et al., 2005; Dı́ez et al., 2005, 2006; Bahamonde et al., 2007). Each piece of meat was de- scribed by: the weight of the animal, ageing time, breed, 6 physical features describing its texture and 12 sensory characteristics rated by 11 different experts (132 ratings). The dataset has 307 different items. In each testing session, 4 or 5 pieces of meat were tested and a group of consumers were asked to rate (on a scale of 1 to 10 points) three different aspects: tenderness, flavor and overall acceptance. The number of consumers involved in this panel was 392. The features of consumers are just sex, age and job. The preferences expressed by consumers were represented in a dataset of preference judgments like D+or where each input x is the concatenation of the feature description of the item and of the user. We only considered pairs where the preferences of the consumer were strictly different for two items. Thus, in each dataset the number of preference judgments is slightly different. 18 Table 1 reports the number of preference judgments for each learning task. The data was preprocessed. The discrete features were binarized in the whole dataset. On the other hand, the continuous features were standardized in each training set; the mean and standard deviation of training data were used to standardize the test set. 6.3. Results and Discussion To estimate the accuracy of the utility functions learned, we used cross validation in the D+or versions of acceptability, flavor and tenderness. In addition to feature descriptions of consumers and items, we added a binary identification of them. That is to say, each object (consumer or item) includes in its description a vector of dimension the number of objects; in that vector all components are 0 but the one with index the ordinal of the object that has value 1. To check the role played by these identifiers, we considered two different versions of each dataset: with and without identifiers. In preference learning, sometimes we do not have any feature description of items or consumers, then we can only use such identifiers. To ease of reading, let us put a simple example. If we have only 3 con- sumers, their representations can be the following. With id codes the con- sumers are presented by consumer1 = (1, 0, 0,sex1,age1,job1), consumer2 = (0, 1, 0,sex2,age2,job2), consumer3 = (0, 0, 1,sex3,age3,job3). 19 gcl gin Dataset SVM⊗ 2 10 100 2 10 100 Acceptability no ID 28.2 28.6 26.4 25.4 31.7 26.8 26.5 with ID 21.9 26.9 18.7 16.2 27.5 18.8 16.1 Flavor no ID 30.7 35.2 29.9 29.7 36.2 28.5 28.6 with ID 24.6 32.9 21.4 19.9 31.5 20.4 18.1 Tenderness no ID 25.5 26.4 24.3 25.0 28.5 23.9 23.9 with ID 21.1 24.3 17.5 15.6 26.4 18.2 16.2 Table 2: Percentages of misclassified preference judgments estimated with 10-fold cross validation using internal grid search for the parameters of the learners. Columns labeled by 2, 10 and 100 report the scores of factorizations obtained with that value of k. The testing was carried out in each fold while training was performed in the remaining 9 Without identification codes, we drop the first three binary components and each consumer will represented only by their sex, age and job. Of course, analogous representations can be used for products. The first block of experiments used a 10-fold cross validation. Systems were trained using 9 folds and the test was performed on the remaining fold. The scores are reported in Table 2. We observe that the performance of the SVM that uses the tensor product (SVM⊗) is worse than the performance of the factorization methods (gin and 20 gcl gin Dataset 2 10 100 2 10 100 Acceptability no ID 39.0 38.3 38.7 37.9 38.0 38.0 with ID 38.7 37.4 37.4 37.7 37.3 36.6 Flavor no ID 43.5 43.4 43.0 43.7 43.1 42.5 with ID 43.1 42.6 42.9 43.2 41.3 40.3 Tenderness no ID 36.5 35.0 35.2 35.3 34.8 35.2 with ID 35.9 34.7 34.1 35.5 34.6 34.5 Table 3: Percentages of misclassified preference judgments estimated with 10-fold cross validation using internal grid search for the parameters of the learners. Columns labeled by 2, 10 and 100 report the scores of factorizations obtained with that value of k. The training was carried out in each fold while testing was performed in the remaining 9 gcl) that are really quite similar. Additionally, the influence of the dimension of the Euclidean space k (Eq. 9) is dramatic in factorization systems. Greater values of k provide better results. In all cases, the scores of the tensorial version are somewhere in the middle of the factorization scores with k = 2 and k = 10. In all cases the use of identifiers improves considerably the scores. In some case the difference is 10 points better with identifiers than without them. The reason is that some items or consumers in test sets were also 21 known in training stage. But this is the case in many sensory data studies. Sometimes the number of options that a food industry is considering for a new product is the whole set of items both in training and in test. On the other hand, if we want to model the assessments of a selected panel of consumers, they must be present in training and in test examples. In the experiments reported in Table 2, 90% of preference judgments are in the respective training set; therefore, most of the consumers and items in each respective test set appear in the training set too. To check the effect of the appearance of already known objects, and also to check the effect of the number of training examples, we performed two additional experiments. However, in this case we used only factorization systems since the perfor- mance of the tensorial systems was very poor. In this way, first we report the experiments carried out with 10-fold cross validation by using each fold as training set and the remaining 9 as test. The results are shown in Table 3. The results are substantially worse, as the number of training examples is very small. Nevertheless, the impact of the identifiers of consumers and items is beneficial in all cases but one, although the increase in accuracy is smaller than in the experiments reported in Table 2. On the other hand, again we realize that higher values of k give rise to better performance. Finally, in Table 4 we report an intermediate setting. Now we use only two folds; therefore, half of the items and consumers in the test set already appeared in the training set. As expected, the results are better than those of Table 3, but worse than the scores shown in Table 2. In this case, for k = 100, the error is mostly below 25% with identifiers, and around 30% without identifiers. The role of k is again of paramount importance. 22 gcl gin Dataset 2 10 100 2 10 100 Acceptability no ID 33.0 30.9 31.5 31.2 31.5 31.6 with ID 30.6 26.1 24.7 31.3 28.4 23.6 Flavor no ID 37.1 33.3 33.4 37.6 33.9 32.1 with ID 34.9 29.1 28.7 36.3 28.2 24.5 Tenderness no ID 29.4 27.8 28.3 28.9 29.1 28.9 with ID 27.1 25.7 24.1 28.3 24.4 23.7 Table 4: Percentages of misclassified preference judgments estimated with 2-fold cross validation using internal grid search for the parameters of the learners. Columns labeled by 2, 10 and 100 report the scores of factorizations obtained with that value of k 6.4. Visualization of Preferences The graphical possibilities of factorization methods, in addition to good prediction scores, provide also some interesting applications. In particular, visualization is very natural when the Euclidean space has up to 3 dimen- sions. But another application is clustering in order to find groups of con- sumers with similar tastes or collections of items with similar appreciations by consumers. In this subsection we illustrate these applications in sensory data analy- sis. To create the subsequent visualizations we used all available data with 23 identities, applying Algorithm 1 with gcl and k = 2. The idea is to obtain pictures where the proximity of one item and one consumer is the utility that represents the preference. The resubstitution error in acceptability is 15.27%, in flavor is 18.47%, and in tenderness is 13.91%. In the graphs, the small dots represent consumers located in R2 according to their ratings of acceptability (Figure 1), flavor (Figure 2) and tenderness (Figure 3) respectively. According to the literature about sensory preferences of beef meat, (Gil et al., 2001; Sañudo et al., 2004; del Coz et al., 2005; Dı́ez et al., 2005, 2006; Bahamonde et al., 2007), the most important features that explain the preferences of consumers are ageing and intramuscular fat (intrafat for short). These are discrete features. Ageing has 3 different values: 1, 7 and 21 days. And intrafat was discretized to obtain 3 options: low, medium and high. Thus, in the same graph of consumers, we represented the average item with each value of these important features. This is a kind of tag in the sense used in (Chen et al., 2012; Moore et al., 2012) of the feature values. The left part of Figure 1 is a Voronoi diagram of the space where seeds are the centroids of the Euclidean representation of the possible values of ageing. The lowest ageing values, 1 and 7 have centroids very close, and what it is really interesting is the split between the low (1 or 7 days) and high (21 days). In the right hand side of the figure, the centroids of items with medium or high intrafat are near and provide a clear split with consumers that prefer low intrafat values. Notice that the split due to ageing and intrafat are almost 24 1 7 21 −1.2 −1.0 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 −2.0 −1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 medium low high −1.2 −1.0 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 −2.0 −1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 Figure 1: Consumers represented according to their ratings in acceptability. Voronoi diagram whose seeds are the centroids of items with different values of ageing (left) and intrafat (right) the same. That is to say, consumers that like meat with 21 days of ageing also prefer meat with low intrafat values. In Figure 2 the position of consumers is different with respect to the previous picture, now the feature rated by consumers is flavor. In this case the relevancy of ageing (left hand side of the figure) is clear. Consumers mostly prefer the flavor of meat after 21 days of ageing. Notice that the relative position of the centroids is increasing from left to right. According to intrafat, flavor divides consumers in those that prefer low or medium (their centroids are quite near) and those that prefer the flavor of meat with high intrafat. There are two market segments according to intrafat when the flavor is the target feature. Finally, Figure 3 depicts consumers located in R2 according to their rat- 25 1 7 21 −1.4 −1.2 −1.0 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 medium high low −1.4 −1.2 −1.0 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 Figure 2: Consumers represented according to their ratings in flavor. Voronoi diagram whose seeds are the centroids of items with different values of ageing (left) and intrafat (right) ings of tenderness. In this case, the centroids are clearly separated. When considering centroids of ageing values we appreciate that 21 is the value more associated with tenderness, this is a well known fact since the ageing is closely related with physical measures of softness in meat. 7. Conclusions We have presented factorization approaches to learning and visualizing preferences of consumers about a kind of products. The models learned are more accurate than existing tensorial approaches that typically use a SVM. The framework presented in this paper includes at the same time factoriza- tion and tensorial methods; both cases use the same learning algorithm with a different equation as the goal to optimize. Then, the accuracy of the model 26 7 1 21 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 −1.2 −1.0 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 high medium low −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 −1.2 −1.0 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 Figure 3: Consumers represented according to their ratings in tenderness. Voronoi diagram whose seeds are the centroids of items with different values of ageing (left) and intrafat (right) can be explained in terms of the number of parameters to learn. Factoriza- tion models are obtained with two embeddings and need substantially less parameters than tensorial approaches. Additionally, embeddings can be seen as Euclidean representations of both consumers and products. The closeness of these representations have a straightforward semantics. Hence, consumers’ clusters can be seen as market segments, and products clusters are groups of similar items with respect to consumer tastes. As in any other knowledge-based system, we observed that the available knowledge about consumers and products is of prime importance. If the identifiers of consumers and products are included, the accuracy of the hy- pothesis learned is dramatically improved. However, if only identifiers are 27 included, there is a drawback that must be considered: no predictions can be made for new (unknown) consumers and/or products. This is the main limi- tation of the method, although the method presented here is flexible enough to be able to use the available knowledge. The overall approach presented in this paper can be extended to other application fields. The requirements include situations where the the interac- tion of two vectors determines a class or an amount endowed with some kind of ordering. This is the case of recommender systems or in general matrix completions, well-known applications of embeddings or matrix factorizations. What we emphasize here is the graphical properties of the Euclidean repre- sentations. Then, it is possible to learn similarities of objects with respect to their behavior with a class. The applications include direct marketing and fraud detection. To check the validity of the proposal we used a set of experiments carried out with real data of sensory analysis of beef meat according to consumer preferences. Factorization methods outperform tensorial SVM. On the other hand, the Euclidean representations obtained in these datasets emphasize the relevance of some well-known traits involved in consumer preferences. The software used in the experiments can be downloaded from this1 web- site. 1We will provide a link to download the implementation in the final version of the paper 28 Acknowledgments The research reported here is supported in part under grant TIN2011- 23558 from the Ministerio de Economı́a y Competitividad, Spain. The paper was written while A. Bahamonde was visiting Cornell University with grants Movilidad Campus de Excelencia Internacional (Universidad de Oviedo) and from Programa Nacional de Movilidad de Recursos Humanos (Ministerio de Educación, Cultura y Deporte, Spain). References Agarwal, D., Chen, B.-C., 2009. Regression-based latent factor models. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp. 19–28. Bahamonde, A., Dı́ez, J., Quevedo, J., Luaces, O., del Coz, J., 2007. How to learn consumer preferences from the analysis of sensory data by means of support vector machines (SVM). Trends in Food Science & Technology 18 (1), 20–28. Basilico, J., Hofmann, T., 2004. A joint framework for collaborative and content filtering. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, pp. 550–551. Bayer, I., 2015. fastfm: A library for factorization machines. arXiv preprint arXiv:1505.00641. 29 Chen, S., Moore, J., Turnbull, D., Joachims, T., 2012. Playlist prediction via metric embedding. In: Proceedings of the 18th ACM SIGKDD Inter- national Conference on Knowledge Discovery and Data Mining. ACM, pp. 714–722. Chen, T., Zheng, Z., Lu, Q., Zhang, W., Yu, Y., 2011. Feature-based ma- trix factorization. Tech. rep., Apex Data & Knowledge Management Lab, Shanghai Jiao Tong University. arXiv:1109.2271. del Coz, J. J., Bayón, G. F., Dı́ez, J., Luaces, O., Bahamonde, A., Sañudo, C., 2005. Trait selection for assessing beef meat quality using non-linear SVM. In: Advances in Neural Information Processing Systems 17 (NIPS ’04). pp. 321–328. Dı́ez, J., Del Coz, J., Bahamonde, A., Sañudo, C., Olleta, J., Macie, S., Campo, M., Panea, B., Albert́ı, P., 2006. Identifying market segments in beef: Breed, slaughter weight and ageing time implications. Meat science 74 (4), 667–675. Dı́ez, J., del Coz, J., Sañudo, C., Albert́ı, P., Bahamonde, A., 2005. A kernel based method for discovering market segments in beef meat. Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases: ECML/PKDD 2005, 462–469. Du, C., Zhe, S., Zhuang, F., Qi, Y., He, Q., Shi, Z., 2015. Bayesian maximum margin principal component analysis. In: Twenty-Ninth AAAI Conference on Artificial Intelligence. 30 Gil, M., Serra, X., Gispert, M., Angels Oliver, M., Sañudo, C., Panea, B., Olleta, J. L., Campo, M., Oliván, M., Osoro, K., Garćıa-Cachan, M., Izquierdo, M., Espejo, M., Mart́ın, M., Piedrafita, J., 2001. The effect of breed-production systems on the myosin heavy chain 1, the biochemical characteristics and the colour variables of longissimus thoracis from seven spanish beef cattle breeds. Meat Science 58 (2), 181–188. Herbrich, R., Graepel, T., Obermayer, K., 1999. Large margin rank bound- aries for ordinal regression. Advances in Neural Information Processing Systems, 115–132. Hüllermeier, E., Fürnkranz, J., 2013. Editorial: Preference Learning and Ranking. Machine Learning, 1–5. Joachims, T., 2002. Optimizing search engines using clickthrough data. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp. 133–142. Koren, Y., Bell, R., Volinsky, C., aug. 2009. Matrix Factorization Techniques for Recommender Systems. Computer 42 (8), 30 –37. Koren, Y., Carmel, L., 2004. Robust linear dimensionality reduction. Visu- alization and Computer Graphics, IEEE Transactions on 10 (4), 459–470. Moore, J., Chen, S., Joachims, T., Turnbull, D., 2012. Learning to embed songs and tags for playlist prediction. In: Proceedings ISMIR. Ocepek, U., Rugelj, J., Bosnić, Z., 2015. Improving matrix factorization rec- ommendations for examples in cold start. Expert Systems with Applica- tions. 31 Pahikkala, T., Airola, A., Stock, M., De Baets, B., Waegeman, W., 2012. Efficient regularized least-squares algorithms for conditional ranking on relational data. Machine Learning, 1–36. Parameswaran, S., Weinberger, K. Q., 2010. Large margin multi-task metric learning. In: Lafferty, J., Williams, C., Shawe-Taylor, J., Zemel, R., Cu- lotta, A. (Eds.), Advances in Neural Information Processing Systems 23, NIPS. pp. 1867–1875. Peltonen, J., Klami, A., Kaski, S., 2003. Learning metrics for information visualization. In: Proceedings of the Workshop on Self-Organizing Maps (WSOM’03). pp. 213–218. Rendle, S., 2012. Factorization Machines with libFM. ACM Transactions on Intelligent Systems and Technology (TIST) 3 (3), 57. Rendle, S., Freudenthaler, C., Gantner, Z., Schmidt-Thieme, L., 2009. BPR: Bayesian Personalized Ranking from Implicit Feedback. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence. AUAI Press, pp. 452–461. Rendle, S., Schmidt-Thieme, L., 2010. Pairwise Interaction Tensor Factoriza- tion for Personalized Tag Recommendation. In: Proceedings of the third ACM International Conference on Web Search and Data Mining. ACM, pp. 81–90. Robbins, H., Monro, S., 1951. A stochastic approximation method. The An- nals of Mathematical Statistics, 400–407. 32 Sañudo, C., Macie, E., Olleta, J., Villarroel, M., Panea, B., Albertı, P., 2004. The effects of slaughter weight, breed type and ageing time on beef meat quality using two different texture devices. Meat Science 66 (4), 925–932. Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A., 2011. Pegasos: Pri- mal estimated sub-gradient solver for SVM. Mathematical Programming 127 (1), 3–30. Thurstone, L. L., 1927. A law of comparative judgment. Psychological Review 34 (4), 273. Weston, J., Bengio, S., Hamel, P., 2011. Multi-tasking with joint semantic spaces for large-scale music annotation and retrieval. Journal of New Music Research 40 (4), 337–348. Weston, J., Bengio, S., Usunier, N., 2010. Large Scale Image Annotation: Learning to Rank with Joint Word-Image Embeddings. Machine Learning 81 (1), 21–35. Xing, E. P., Jordan, M. I., Russell, S., Ng, A. Y., 2002. Distance metric learning with application to clustering with side-information. In: Advances in neural information processing systems. pp. 505–512. Yu, S., Yu, K., Tresp, V., Kriegel, H.-P., Wu, M., 2006. Supervised prob- abilistic principal component analysis. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data min- ing. ACM, pp. 464–473. 33 Introduction Related Work Formal Framework Maximum Margin Approach Factorization and Tensorial Approaches Mapping Consumers and Products: Matrix Factorization Inner Products Euclidean Closeness Tensor Product Experimental Results Implementation Details Datasets Results and Discussion Visualization of Preferences Conclusions