key: cord-0467379-i1r37joz authors: Wu, Hao; Jones, Gareth J. F.; Pitie, Francois title: Response to LiveBot: Generating Live Video Comments Based on Visual and Textual Contexts date: 2020-06-04 journal: nan DOI: nan sha: 7a7de01c2e2997f9434fd079bdff964dd6b155d1 doc_id: 467379 cord_uid: i1r37joz Live video commenting systems are an emerging feature of online video sites. Recently the Chinese video sharing platform Bilibili, has popularised a novel captioning system where user comments are displayed as streams of moving subtitles overlaid on the video playback screen and broadcast to all viewers in real-time. LiveBot was recently introduced as a novel Automatic Live Video Commenting (ALVC) application. This enables the automatic generation of live video comments from both the existing video stream and existing viewers comments. In seeking to reproduce the baseline results reported in the original Livebot paper, we found differences between the reproduced results using the project codebase and the numbers reported in the paper. Further examination of this situation suggests that this may be caused by a number of small issues in the project code, including a non-obvious overlap between the training and test sets. In this paper, we study these discrepancies in detail and propose an alternative baseline implementation as a reference for other researchers in this field. Live commenting mechanisms has become a core feature for video platforms like Bilibili, one of the most popular video sharing platform in China with more than 300 millions monthly active users 1 . This feature increases user interaction by providing a real-time commentary subtitle system that displays user comments as streams of moving subtitles overlaid on the video playback screen, visually resembling a danmaku shooter game. These comments are are simultaneously broadcast to all viewers in real-time, they were originally called "danmaku" in the Nicovideo Japanese platform and then "弹幕" in the Chinese Bilibili platform. In the rest of this document we refer to them by the Pinyin (a romanization system for Chinese characters) version: "danmu". Figure 1 shows an example of a video from Bilibili with a few danmu overlaid. The danmu system is different from the commenting system or online streaming system in most of the video sharing platforms, since it provides a chat room experience in which users can watch and discuss together. Several efforts have been made to investigate this new type of media content [1, 5, 6] . The creation of new danmu comments to enrich videos has the potential to improve the viewing experience of viewers and to help attract more viewers. Shuming et al. [7] proposed "Livebot", which use a unified transformer architecture to automatically generate new danmu comments given existing danmu comments and video frames. LiveBot uses AI agents to comprehend the videos and to interact with human viewers who also make comments. In this work, a large-scale live comment dataset with 2,361 videos and 895,929 live comments was constructed. In an attempt to replicate the proposed method [7] , using this dataset whih was provided to the research community by the authors, we found that our results were much lower than the reported baselines. To understand this issue we carefully reviewed the implementation and dataset provided by the authors on their project webpage, and found a number of potential issues which may explain this discrepancy. We examine these problems one by one and analyse their impacts. Finally we propose a new baseline implementation which could serve as an independent reference. In this section we briefly introduce the background of ALVC task. We refer readers to the original paper [7] for more detailed information. The live commenting dataset built for the Automatic Live Video Commenting (ALVC) task was collected from Bilibili and contains 2361 videos and 895,929 comments. Each video comment is associated with its related video time tag, which indicates where in the video each comment should appear. The processed dataset partition, the raw dataset and the code are available at the GitHub page [4] . In this paper we use this dataset for our experiments, Table 1 shows the detailed statistics of the dataset. The ALVC task is defined as follows: given a video V, a timestamp t and the surrounding comments C near the time-stamp, the commenting system should generate a comment Y relevant to the clips and/or the other comments near the time-stamp. Specifically, the model takes the nearest m frames (I = I 1 , I 2 , . . . , I m ) and n comments (C = C 1 , C 2 , . . . , C n ) from the time-stamp t as input, and aims to generate a comment y = y 1 , y 2 , . . . , y k . For our investigation we follow the model structure described in [7] , and illustrated in Figure 2 (see section "Model II: Unified Transformer Model" in [7] ). Comments and video frames are encoded using a Transformer architecture [8] . The model consists of 3 parts: the video encoder which encodes video frames into a visual representation; the text encoder which generates the contextual vector by encoding the sequence of input words combined with the visual representation; and finally the comment encodes which combines these vectors in comment decoder to generate output tokens recursively. Retrieval based evaluation metrics are used in the reported experiments to automatically evaluate ALVC: a candidate comment set is constructed for each test sample, then the model is asked to sort the candidate set; the authors assume that a good model is able to rank the correct comments at the top of the set proposed comments. The candidate set contains 4 types of comments: • Correct: 5 groundtruth comments from humans. • Plausible: 30 comments most similar to the title of the video based on tf-idf score. • Popular: 20 comments in the training set. • Random: Random comments taken from the training set to ensure there are 100 unique comments in the generated output set. The following retrieval metrics are used to evaluate the results: • Recall@k: the proportion of human comments found in the top-k recommendations, • Mean Rank: the mean rank of the human comments, • Mean Reciprocal Rank: the mean reciprocal rank of the human comments. Results for all these metrics are presented in Table 3 . We also report the confidence interval for each of these metrics. For recall@k we use the confidence interval for population proportions with confidence level at 95% and for MR and MRR, we use the confidence interval with same confidence level. We first tried to reproduce the work of [7] using the released the code and the dataset, For reference, the Livebot results are reported in Table 3 and labeled with "Livebot paper". Specifically, results with different input are reported (e.g. "Text Only" means text input are all masked during test stage). We conducted our experiments using the code provided on the authors' Github project page, and used the same model structure and configurations (batch size, learning rate etc.) described in [7] . The results we obtained are shown in the same table with the label "Issue #1". Clearly the results from our experiments are much lower than the baselines. In order to explore the reasons for the performance mismatch, we conduct a series of investigations examining the GitHub implementation and the released dataset. From our investigation, we have identified a number of issues with GitHub implementation which are presented below. 3.1 Issue #1: Candidate Set Ranking. First, in the implementation, the re-ranked candidate list is sorted based on the cross-entropy loss in descending order. However according to the paper, a good candidate should be placed at the topside of the candidate list, in this case the cross-entropy loss should be sorted in ascending order. This issue is also raised in the GitHub issue page by another researcher 2 , the corresponding results are labeled 'GitHub Issue" in Table 3 . We report the results with this issue fixed (see "Issue #1-2"). The scores are very close to the results from the GitHub issue page, we can see that after fixing the ranking problem the scores improve a little, but are still significantly lower than the reported Livebot baselines. We then carefully looked at the evaluation code and noticed a subtle error in the candidate score computing: in the original implementation the score of a candidate is computed as the sum of the cross-entropy loss for every token rather than the mean value. This results in an advantage for short candidates and, in fact, we found that the top re-ranked positions in the list are mostly occupied by comments of only one word. We fixed the code by averaging the score over every non-ignored token (tokens for padding and separating are ignored when computing cross-entropy loss). Thus instead of we implemented: where д i and h i are the i-th output token and ground truthtoken, L is the maximum length of the model output (including padding), #Valids is the number of valid tokens in a candidate. The results reported as "Issue #3" in Table 3 , at this step we obtain scores that are closer to the baselines. We also found an inconsistency in constructing the plausible set. It is described in the paper that when building the candidate list, the plausible set is retrieved based on the video title. However, in the implementation we noticed that the plausible set is retrieved using current context comments (The comments surrounding the ground truth comment, which is also the text input) as the query rather than video title. Unfortunately, the mapping between the raw dataset and the provided dataset are not given, so we are not able to reconstruct the provided dataset from the raw dataset, and hence could not direct compare the results with and without fixing We carefully examined the released dataset, specifically we checked the overlapped comments across the training and test set of the given processed dataset. There are 5,436 out of 17,771 comments in the test set also appear in the training set. Although some popular comments can be expected to appear in different videos, after manually checking the provided dataset we found that there are a number of identical videos assigned with different video ids that appear in both the training and test sets. Table 4 lists several examples we found of this situation. In the raw dataset we use video title to uniquely identify a video and found that there are 38 videos which appeared more than once in the raw dataset. To address this issue, we decide to build the dataset from the raw dataset rather than directly update the processed dataset due to the lack of video mapping between the raw dataset and the processed dataset. After removing redundant videos from the raw dataset we end up with 2322 unique videos. We follow the Livebot paper and split the training / development / test set into 2,122 / 100 / 100 videos and conducted experiments with all above issues fixed. (statistics of the dataset are summarised in Table 2 ). This dataset is labeled as "No duplicate" in the result table. Our results after removing the duplicate videos are shown as "Issue #1-4" in Table 3 . Compared to "Issue #1-3" the performance can be observed to be slightly lower, which is what we could anticipate since the model no longer gains from the overlapped information across the training and test set. In order to provide a reproducible implementation for later research on the ALVC task, we re-implemented the transformer network of LiveBot using the OpenNMT [3] open-source neural machine translation framework. We followed the model structure shown in figure 2 , and used the newly constructed dataset described in section 3.4, with all duplicate videos removed. The vocabulary size is set to 30,000 to keep it consistent with the original paper, and in the transformer network, the size of the word embedding and hidden layer are set to 512, as in [7] . Additionally, the batch size is set to 64 and dropout rate to 0.2. The optimization method is chosen as Adam [2] , with β 1 = 0.9 and β 2 = 0.998. Results of this re-implementation are reported in Table 3 under the label "Re-implementation". With past issues resolved. At this stage the scores we are get are very close to run "Issue #1-4", we believe the implementation and scores generated are valid and could serve as a new baseline for this task. The code and the dataset used to generate the above result is available on GitHub 3 In this paper we reviewed the code presented as the official LiveBot implementation and found a number of discrepancies with the original paper. We have addressed each of these issues and reported Table 4 : Several Comments that appear both in training and test data set of the provided dataset. Translation 像我这么瘦的可能效果不会太明显,各种无器械动作交杂着 做两个多月才有了明显的变化,还不是很大 It might not be obvious for skinny people like me, there are only minor changes after 2 month of exercise. Can not tell anything from this, muscle growth is about resting and recovering rather than work out everyday. Doing 100 requires endurance not strength, compound push-up is the best. 每天100个俯卧撑100个仰卧起坐跑步10公里坚持3年然后再 把头发剃光滑稽 100 push-ups 100 sit-ups 100 squats and a 10km run every single day for 3 years then shave your hair lol. 练肌肉最费钱,想练快就每天吃低脂牛肉,配合锻炼,半 年就有显著变化 Muscle gain is expensive, regular exercise with low-fat beef and you will see the changes in half a year updated results accordingly. We also propose a new baseline implementation using the OpenNMT framework. The updated baseline results are still lower than the ones reported in the original Livebot paper. However, since we do not access to the exact version of code used to produce the results these original results are are not able to determine the exact reason for these differences, but based on our experiments and out analysis, we believe this performance gap is caused by the removal of the duplicate videos. Stories That Big Danmaku Data Can Tell as a New Media Adam: A method for stochastic optimization OpenNMT: Open-Source Toolkit for Neural Machine Translation Visual-Texual Emotion Analysis with Deep Coupled Video and Danmu Neural Networks Gossiping the videos: An embedding-based generative adversarial framework for time-sync comments generation Livebot: Generating live video comments based on visual and textual contexts Attention is all you need ACKNOWLEDGEMENT This work was supported by Science Foundation Ireland as part of the ADAPT Centre (Grant 13/RC/2106) (www.adaptcentre.ie) at Trinity College Dublin.