key: cord-103242-78asegs6 authors: Yang, Wenmian; Gao, Wenyuan; Zhou, Xiaojie; Jia, Weijia; Zhang, Shaohua; Luo, Yutao title: Herding Effect based Attention for Personalized Time-Sync Video Recommendation date: 2019-05-02 journal: nan DOI: 10.1109/icme.2019.00085 sha: doc_id: 103242 cord_uid: 78asegs6 Time-sync comment (TSC) is a new form of user-interaction review associated with real-time video contents, which contains a user's preferences for videos and therefore well suited as the data source for video recommendations. However, existing review-based recommendation methods ignore the context-dependent (generated by user-interaction), real-time, and time-sensitive properties of TSC data. To bridge the above gaps, in this paper, we use video images and users' TSCs to design an Image-Text Fusion model with a novel Herding Effect Attention mechanism (called ITF-HEA), which can predict users' favorite videos with model-based collaborative filtering. Specifically, in the HEA mechanism, we weight the context information based on the semantic similarities and time intervals between each TSC and its context, thereby considering influences of the herding effect in the model. Experiments show that ITF-HEA is on average 3.78% higher than the state-of-the-art method upon F1-score in baselines. Recently, watching online videos of news and amusement has become mainstream entertainment during people's leisure time. Therefore, efficient and accurate personalized video recommendation methods bring significant convenience to their life. Most of the video recommendation methods focus on users' behaviors such as their browsing history [1, 2] and reviews [3, 4] . In real scenarios, however, most people are unwilling to do high-quality reviews after watching videos, which causes the scarcity of valuable video reviews. Furthermore, some multi-feature based methods [5, 6] combine image information with review information to generate users' interests from a more comprehensive perspective. However, their methods have only achieved limited improvement because the images and reviews usually contain unequal information [7] . That is, the text information and image information generally describe the different amount of contents. An image in a video only describes one moment of the video content, while a review usually describes the overall contents of the video. The information gap causes the fusion of reviews and images to lose great information. Meanwhile, a new form of user-interactive review -timesync comment (TSC) (first introduced by Wu et al. [8] , see Fig. 1 ) has become increasingly popular in China and Japan, especially among young people. Nowadays, many popular Chinese video websites such as Youku (http://www.youku.com), Bilibili (http://bilibili.tv) and the Japanese video website NICONICO (http://www.nicovideo.jp) support the TSC. TSCs convey information involving the content of the current video frame, feelings of users or replies to other TSCs, which can accurately express the users' preferences for the video. Moreover, each TSC has a corresponding timestamp to record the posted time. Compared with traditional video reviews, TSCs are much easier to obtain their corresponding images by timestamps. The users' real-time feedbacks and the vast amount make TSCs valuable and accessible sources for personalized video recommendations. In this paper, we focus on mining the users' preferences and videos' features from TSCs and corresponding images to recommend videos towards users through model-based Collaborative Filtering (CF). TSCs have several features distinguished from the traditional video reviews: (1)Context-dependent. TSCs are usually context-dependent, i.e., the latter comments often depend on the former ones. This phenomenon is known as the herding effect in social science [9, 10] . An example of the herding effect is shown in Fig. 1 . User A said "I love the male commander!" to express his love to the role male commander when he appears in the video. After a few seconds, user B and user C followed up by saying "I like the male commander too..." and "I am so sad when he died." In this case, user B and user C may not make their comments if user A has not. That is, the emergence of a TSC is usually not independent, but a probability event influenced by other preorder comments. (2)Real-time. Each TSC has a timestamp synchronous to the playback time of the video. The coverage of a TSC is usually only a short time before its timestamp. Therefore, the content of each TSC is closely related to the video content corresponding to its timestamp, which makes it easy to sample corresponding image information by timestamp. (3)Timesensitive. According to our observation, TSCs with a large interval of timestamps are unlikely to discuss the same topic, even if their semantics are similar. Users are more likely to follow those newer TSCs than older ones. As a result, the herding effect mentioned before will not last long. These features make TSC a particular review. However, most of the current TSC-based recommendations [3, 11] assume that TSCs are independent of each other and ignore the time information. Such assumption ignores the above features of TSCs which causes the loss of crucial semantic information and affects the accuracy of results. Therefore, how to take TSCs' features of context-dependent (herding effect), real-time and time-sensitive into account to extract the textual information and fuse it with visual information accurately and effectively are the central challenges. Based on the above motivations and challenges, we propose an Image-Text Fusion model with a novel Herding Effect Attention mechanism (called ITF-HEA). In ITF-HEA, We generate users' preferences and summarize video contents through model-based CF. To analyze the influence of text information, image information and contextual information separately, we split ITF-HEA into two models: Text-based Model(TM) and Image-Text Fusion model (ITF), and one attention mechanism: Herding Effect Attention (HEA) mechanism. Specifically, in TM, we sample and embed the TSCs to obtain the sentence vectors (TSC features) by bidirectional Long Short-Term Memory (LSTM) at first, and then combine TSC features with the hidden (embedding) features of the users and videos respectively to predict the likeness of the user to the video. In ITF, we sample the corresponding video frame (image) features and incorporate them with TSC features to replace single TSC features in TM. Finally, we design the HEA mechanism which is based on contextual semantic similarity and time interval of TSCs to incorporate contextual information into TSC features to replace the features in TM and ITF. The main contributions of this paper are as follows: 1 We propose a novel HEA mechanism, which takes TSCs' features of contextual relevance, real time and timeliness into account, to extract the textual features of the TSCs more accurately and effectively. 2 We design an Image-Text Fusion model using model-based CF, combine it with HEA mechanism and get the ITF-HEA, which can predict the likeness of the user to the video more accurately and sufficiently. 3 We evaluate the ITF-HEA with real-world datasets on mainstream video-sharing websites and compare it with state-ofthe-art video recommendation methods. The results show that our model outperforms baselines in both precision and F1score on average 3.3% and 3.78% respectively. In this section, we discuss the related work in three aspects. Time-sync video comments are first introduced by Wu et al. [8] . Then, Yang et al. [12] sum up the features of TSCs, which inspire our work. LV et al. [13] propose a video understanding framework to assign temporal labels to highlighted video shots. They are the first to analyze the TSCs using the neural network. Recently, Liao et al. [14] present a larger-scale TSC dataset with four-level structures and rich self-labeled attributes, which brings convenience for future research on TSCs. The above methods show that TSC is a kind of data with great potential and development value. Video recommendation has attracted great attention from both the industry and academia. Most of the current state-of-the-art methods are based on CF. Mcauley et al. [15] combine latent rating dimensions with latent review topics, which is a review-based method. Diao et al. [16] propose a probabilistic model based on CF and topic modeling, which is an LDA [17] based method and allows capturing the interest distribution of users and the content distribution of movies. He et al. [18] propose a scalable factorization model to incorporate visual signals into predictors of people's opinions, which is the state-of-the-art visual-based model. However, the above recommendation methods are not well-designed for TSCs as they ignore the interactive, real-time, and timeliness properties of TSC data. Attention Mechanism has been shown effective in natural language processing [19, 20, 21] . Recently, attention models have been used increasingly in recommendation systems to assign weights to user-item pairs. Chen et al. [22] introduce a novel attention mechanism in CF to address the challenging item and component-level implicit feedback in the multimedia recommendation, which can be seamlessly incorporated into classic CF models with implicit feedback. Seo et al. [23] propose to model user preferences and item properties using convolutional neural networks (CNN) with dual local and global attention. Our herding effect attention mechanism adopts the soft attention [24] , which learns the attentive weights based on the importance to the final task. In this section, we describe two CF models and an attention mechanism. First, the problem formulation is provided in Section 3.1. Then, we propose a Text-based Model by using textual features of TSCs in Section 3.2. Next, we design an Image-Text Fusion Model to jointly model video images as well as TSCs in Section 3.3. Finally, to take full consideration of the features of TSCs, we implement Herding Effect Attention mechanism and give the complete neural network structure of the Image-Text Fusion Model with Herding Effect Attention in Section 3.4. Suppose there are N TSCs, T SC = {tsc1, tsc2, ..., tscN }. For tsci, we define the corresponding visual feature as vsli (see Section 3.3 for details) and sentiment polarity poli which is determined by the Stanford sentiment analysis toolkit (http://nlp.stanford.edu/sentiment). Besides, we define ui to represent the user ID and vi to express the video ID of tsci. As mentioned in section 1, TSCs are easily affected by previous comments. Therefore, for tsci, we continuously sample M preorder TSCs Contexti = {prei,1, prei,2, ..., prei,M } as context information (prei,M is tsci itself). For each prei,j ∈ Contexti, we define the time-stamp ti,j to represent its posted video time. The word list of tsci and its context information prei,j are defined as Intuitively, the preference of users is extracted from their published TSCs in corresponding videos, while the textual features of the videos are summed up from the TSCs published in the video. Based on above, in this section, we first extract the features of TSCs by bidirectional LSTM. Then, features of TSCs are merged with the latent factors of users and videos. Finally the likenesses of the users to videos are predicted by CF. The general framework of Text-based Model (TM) is shown in Fig. 2 . More concretely, to capture the word sequential information from TSCs, we use the Bidirectional Long Short-Term Memory (Bi-LSTM) network [25] to convert word features into TSC features. For each tsci, we have and where seqi ∈ R d and ⊕ denotes vector concatenation. After LSTM layer, we get the sequence feature seqi as the output. Then, we define GUu i as the latent factor of user ui, which is the feature based on user's historical preference. Likewise, the feature of video vi is defined as GVv i . Afterward, we design ⊗ function to merge GUu i and GVv i with seqi respectively, and obtain pi = Gu i ⊗ seqi (4) and where ⊗: In our framework, we take the prediction of a user's favor to a video (or video clip) as a binary classification problem, where 1 means a user likes the video, and 0 otherwise. Therefore, we define the likeness of user ui to video vi though tsci in the training data aŝ where sigmoid(x) = 1 1+e −x and " " denotes inner product. Generally, users comment on their favorite videos with positive sentiment. Therefore, we determine the polarity of each TSC by the Stanford sentiment analysis toolkit [26] . For simplicity, we set the polarity of each TSC as 1 if the result is positive or neutral, and 0 otherwise. We define yi = poli as the ground truth of the likeness of user ui for video vi through tsci, where poli is the polarity of tsci. At last, we use the binary cross-entropy as our loss function to model user preference. The final objective function is maximized as: In the training phase, the parameters can be learned via Adam [27] . After trainning, we usê to express the predicted likeness of user u on video v, and to express the real likeness of user u for video v, where Listu,v expresses the total TSCs that user u has commented on video v. Then, the ground truth of testing data is defined as yu,v = 0 P ou,v < 0.5 1 P ou,v ≥ 0.5 (10) In the time-sync video, each TSC has a timestamp that records the corresponding video time when the TSC is published. So that, we can easily obtain the corresponding image information for better feature extraction. In this section, we focus on merging TSC text features with corresponding visual features to obtain more comprehensive features. The general framework of Image-Text Fusion model (ITF) is shown in Fig. 3 . For tsci, we use vsli to indicate the visual feature when tsci is posted. The visual features are with the output of 4096-way obtained from a public TSC data set extracted by Chen et al. [3] , which are trained by the Caffe reference model with 5 convolutional layers followed by 3 fully-connected layers that have been pre-trained on 1.2 million ImageNet (ILSVRC2010) images. Since the dimension of vsli is 4096, we reduce its dimension to d and get vsl i by where Dense is the fully-connected layer with the activation function elu [28] . To combine the image features and textual features, we first concatenate the sequence feature seqi with the visual feature vsl i , and obtain the 2 × d-dimensional vector comi: Then, we reduce the demension of comi to d and get the com i by Finally, we use com i instead of seqi to merge with GUu i and GVv i by Eq.(4) and Eq.(5) and predict the likeness by Eq.(6). Existing review-based recommendation methods usually handle each comment separately without considering the context associations between the comments. However, TSCs are highly semantic relevant and time-related, which is so-called the herding effect. That is, TSCs may be affected by other preorder TSCs on the similar topic. Also, TSCs with similar semantics and the short interval of the time-stamp are more likely to influence each other. Based Fig. 4 . Herding Effect Attention on above, we design an HEA mechanism, which calculates the influence weights of TSC contexts by their semantic similarities and timestamp intervals in an LSTM-based encoder-decoder framework. The framework of HEA is shown in Fig. 4 . We formalize the HEA into an encoder-decoder framework. Given context features SEQi as the input to the LSTM, the output are obtained as: We use H = (h1, h2, ..., hM ) to express the hidden status vectors of the encoder output. To calculate the influence weights of contexts of TSCs, for prei,j, we define the semantic similarity vector and time delay vector as SIMj = (sim(j, 1), sim(j, 2), ..., sim(j, M )) and T Dj = (delay(j, 1), delay(j, 2), ..., delay(j, M )) , where sim(k, j) = seqi,k seqi,j |seqi,k||seqi,j| (15) represents the semantic similarity between prei,j and pre i,k , and represents the influence of pre i,k on prei,j decreases with the increasing time interval (β is a hyper-parameter that will be discussed in Section 4). Since the semantic similarity may have negative numbers, we first normalize SIMj by softmax as: Next, we calculate the attention score vector of prei,j as: The final attention score distribution Aj is obtained by normalizing the attention score vector Aj by softmax function. We compute the input of decoder as: and get the output as: Finally, we use hM instead of seqi that we used in Section 3.2 and 3.3 as the textual feature. We integrate the context-dependent, real-time and time-sensitive properties of the TSCs into the model by the HEA mechanism, which can be applied in both TM and ITF to improve the TSC feature extraction. The complete network structure of the ITF-HEA is shown in Fig. 5 In this section, we demonstrate the effectiveness of our proposed method by comparing with 4 well-known methods of video recommendation. We provide necessary parameters of our model at first and then analyze the performance of our model on time-sync video recommendation. Finally, we analyze the effect of the hyperparameters on the experimental results. The data used in this paper are crawled from a Chinese time-sync video site Bilibili by Chen et al. [3] , which are obtained from the movie category till December 10th, 2015. In this paper, we select 100 users who have posted the most TSCs and commented on more than 40 videos. These users have commented on a total of 871 videos, and we select all the comments in those videos as a subdataset. In the sub-dataset, 423,384 users have published 1,319,475 TSCs in total. For each of the 100 users, we select half of the videos where they have commented as the training set, and the other half as the test set. We make sure that at least 20 videos per user can be recommended (To ensure the effectiveness of top20). In the test set, we get 2,995 (user, video) pairs, where 1,972 pairs are positive, and 1,023 pairs are negative in sentiment polarity. In the training set, we obtain 2,811 (user, video) pairs with 11,775 TSCs (a user may make more than one TSC in a video), where 8,124 TSCs are positive and 3,651 TSCs are negative. In our model, hyper-parameter β and number of contextual TSCs M need to be decided. We select 35% data of the test set (actually 1,075 (user, video) pairs) as the validation set to tune β. The initial learning rate of Adam [27] is 0.001 and the vector dimension d is set as 128. We get the best results when β = 0.2, and M = 10, which are discussed in Section 4.2. In this section, we use the test set described in Section 4.1 to compare our complete model with existing methods. To evaluate the performance of the proposed models, we compare our model with the following methods as baselines: • HFT: A state-of-the-art method regarding making rating prediction with textual reviews [15] . In the experiments, we set the ratings of positively commented videos as 1, and 0 otherwise. • JMARS: A Latent Dirichlet Allocation (LDA) based method to make rating prediction with textual reviews [16] . • VBPR: A visual-based recommendation method [18] . • KFRCI: A novel Key Frame Recommender by modeling user TSCs and keyframe Images simultaneously [3] . In the experiments, the likeness score of the video is the average score of all the frames the users have commented on. • ITF-HEA: The Image-Text Fusion Model proposed in Section 3.3 with Herding Effect Attention mechanism proposed in Section 3.4. For each method in the baselines, we select a set of the best experimental parameters according to the range of the parameters given in their experiments and calculate the likenesses/ratings between users and videos. Our experiments are conducted by predicting Top 5, 10, and 20 favorite videos respectively. The Top X is the top prediction of user's likeness to the videos in test set calculated by Eq. (8) . We recommend all X videos to each user and consider these are the user's favorite videos. We adopt F1-score and precision to evaluate the performance of the baselines and our models. All the models are repeated for 10 times, and we report the average values as the final results for clear illustration. The results of F1-score and precision are shown in Table 1 . From Table 1 , we can see: ITF-HEA achieves the best performance on F1 and precision (F1 is proportional to precision in the Top 5, 10 and 20). It has enhanced the performance by about 2.30%, 3.91% and 5.14% (on average 3.78%) upon F1-score and 2.20%, 3.50% and 4.20% (on average 3.30%) upon Precision on Top5, 10 and 20 respectively compared with KFRCI, which performs best among baselines. In other methods of baselines, the vision-based method VBPR has better performance than the others; the text-based method HFT and JMARS have similar performance, while the PMF method has the worst. Next, we compare the models proposed in Section 3: • TM: The Text-based Model proposed in Section 3.2. • T-HEA: The Text-based Model proposed in Section 3.2 with Herding Effect Attention mechanism proposed in Section 3.4. • ITF: The Image-Text Fusion Model proposed in Section 3.3. • ITF-HEA: The Image-Text Fusion Model proposed in Section 3.3 with Herding Effect Attention mechanism proposed in Section 3.4. to analyze the effects of text features, image features and the attention mechanism in our model. The results of F1-score and precision are shown in Table 2 . The results show that although T-HEA only uses the textual information, the experimental results are still better than ITF. T-HEA even has better performance than the state-of-the-art method KFRCI, which shows that our HEA mechanism can effectively offset the effects of the herding effect and improve the performance of the model. The results also show that the context and timestamp of the TSCs are vital information and need to be considered. Finally, we discuss the influence of hyper-parameter β and the number of contextual TSCs M on the experimental results. We fix M = 10, changing the value of β from 0 to 0.5 (the step size is 0.1), and calculate the F1-score of Top 5,10 and 20 users' favorite videos in the validation set at first. The best results are obtained when β = 0.2. We also calculate the F1-score for the different hyper-parameters in the test set, and the results are shown in Fig. 6 . The hyper-parameter β gains the best performance when β = 0.2 in any case, which is the same with the validation set. When β is bigger, the result of the experiment is worse because it weakens the weight of other TSCs in the attention layer. When β = 0, it has the worst performance, because the time information is not considered. For the number of context length M , we fix β = 0.2, and set M as 5, 10, 15 and 20, respectively. The results are shown in Fig. 7 . IFT-HEA gets the best performance when M = 10 and the worst when M = 20. This result confirms that the herding effect of TSCs is time-sensitive and will not last long, which meets our observation in Section 1. In this paper, we proposed a novel personalized online video recommendation with the dataset of both TSCs and its corresponding images through model-based CF method. To extract the textual features of the TSCs more accurately and effectively, we designed the HEA mechanism to add influence weight to each TSC based on their semantic similarity and time interval. In this way, we integrated the context-dependent, real-time and time-sensitive properties of TSCs in the neural network framework and predicted the users' preferences for online videos accurately and effectively. Extensive experiments on real-world dataset proved that our model could recommend videos to users more precisely than the state-of-the-art method with the HEA mechanism. This is the first step towards our goal in personalized video recommendation, and there is much space for further improvements. For example, to design a more accurate fusion model to capture comprehensive user preference is a challenging problem. Besides, how to measure the weight of each video to the user's preference is also a challenging problem. When recurrent neural networks meet the neighborhood for session-based recommendation Exploring the use of time-dependent cross-network information for personalized recommendations Personalized key frame recommendation Aspect based recommendations: Recommending items with the most valuable aspects based on user reviews A unified personalized video recommendation via dynamic recurrent neural networks Contextual video recommendation by multimodal relevance and user feedback Multimodal neural language models Crowdsourced time-sync video tagging using temporal and personalized topic modeling Predicting the popularity of danmu-enabled videos: A multifactor view Tradeoff between distributed social learning and herding effect in online rating systems: Evidence from a real-world intervention Video recommendation using crowdsourced timesync comments Crowdsourced time-sync video tagging using semantic association graph Reading the videos: Temporal labeling for crowdsourced time-sync videos based on semantic embedding Tscset: A crowdsourced time-sync comment dataset for exploration of user experience improvement Hidden factors and hidden topics: understanding rating dimensions with review text Jointly modeling aspects, ratings and sentiments for movie recommendation (jmars) Latent dirichlet allocation Vbpr: Visual bayesian personalized ranking from implicit feedback Attention-fused deep matching network for natural language inference Translating embeddings for knowledge graph completion with relation attention mechanism Neural relation extraction via inner-sentence noise reduction and transfer learning Attentive collaborative filtering: Multimedia recommendation with item-and component-level attention Interpretable convolutional neural networks with dual local and global attention for review rating prediction Attention is all you need Bidirectional recurrent neural networks The Stanford CoreNLP natural language processing toolkit Adam: A method for stochastic optimization Fast and accurate deep network learning by exponential linear units (elus)