key: cord-0536638-xz7t2iwa authors: Chen, Wei; Li, Shuzhe; Huang, Chao; Yu, Yanwei; Jiang, Yongguo; Dong, Junyu title: Mutual Distillation Learning Network for Trajectory-User Linking date: 2022-05-08 journal: nan DOI: nan sha: a3dc3b6d060c0779fd5bc5790aee0956371ac872 doc_id: 536638 cord_uid: xz7t2iwa Trajectory-User Linking (TUL), which links trajectories to users who generate them, has been a challenging problem due to the sparsity in check-in mobility data. Existing methods ignore the utilization of historical data or rich contextual features in check-in data, resulting in poor performance for TUL task. In this paper, we propose a novel Mutual distillation learning network to solve the TUL problem for sparse check-in mobility data, named MainTUL. Specifically, MainTUL is composed of a Recurrent Neural Network (RNN) trajectory encoder that models sequential patterns of input trajectory and a temporal-aware Transformer trajectory encoder that captures long-term time dependencies for the corresponding augmented historical trajectories. Then, the knowledge learned on historical trajectories is transferred between the two trajectory encoders to guide the learning of both encoders to achieve mutual distillation of information. Experimental results on two real-world check-in mobility datasets demonstrate the superiority of MainTUL against state-of-the-art baselines. The source code of our model is available at https://github.com/Onedean/MainTUL. The rapid development of Location-Based Social Network (LBSN) platforms has made it easier for humans to digitize their mobility behaviors by sharing their check-ins, opinions, and comments [Lian et al., 2020] . These mobility behaviors can be used to understand and predict human movement patterns, facilitating intelligent business models and user experience. However, individual mobility is not always predictable due to the missing and sparsity of check-in data [Lian et al., 2014] . Trajectory-user linking (TUL) [Gao et al., 2017] is recently proposed as a task to identify user identities based on personal mobility trajectories. It plays an important role in revealing basic human movement patterns by mining user mobility behaviors. In addition, TUL could benefit a broad range * Corresponding Author of applications in business, transportation, epidemic prevention, and public safety, such as location-based services [Liu et al., 2019] , tracking COVID-19 pandemic [Hao et al., 2020] , intelligent transportation , and identifying terrorists/criminals for public safety [Huang et al., 2018] . In this work, we are interested in linking trajectories to their potential users for check-in mobility data. TUL can essentially be seen as an extension of traditional trajectory classification tasks. Traditional trajectory measurement methods such as Longest Common Sub-Sequence (LCSS) and Dynamic Time Warping (DTW) can identify the most likely users by measuring the similarity between unknown trajectories and known trajectories. Recently, a handful of studies [Zhou et al., 2018 [Zhou et al., , 2021a Miao et al., 2020] have been developed for solving the TUL problem through deep trajectory representation learning. However, the existing methods still have three key limitations. First, all existing approaches still suffer from data sparsity, and perform poorly on sparse check-in mobility datasets. Second, existing methods only focus on spatial feature and/or temporal feature, and ignore the rich contextual features in check-in data such as POI categories. Third, most existing methods neglect the utilization of historical data. Due to the inherent sparsity of check-in data, the historical data of same users implies users' more complex movement patterns, which may help improve the model performance. Nevertheless, how to effectively utilize the knowledge from historical data remains a significant challenge. To address the aforementioned challenges, we propose MainTUL, a Mutual distillation learning network model, to solve the TUL problem for sparse check-in trajectory data. In MainTUL, we design two different trajectory encoders -an RNN-based encoder to learn spatio-temporal movement patterns of input trajectory, and a temporal-aware transformer encoder to capture long-term time dependencies for the corresponding augmented trajectory. Then, the knowledge learned from the augmented trajectory data is transferred to guide the learning of RNN-based trajectory encoder. Meanwhile, input trajectory and augmented trajectory is exchanged to realize a mutual distillation learning network. Additionally, we also design a check-in embedding layer to produce multi-semantic check-in representations integrated with POI category and time information, which are then fed into the mutual distillation learning network. Experimental results on two real-life human mobility datasets show that our model significantly outperforms state-of-the-art baselines (14.95% Acc@1 gain and 14.11% Macro-F1 gain on average) in TUL task. Our contributions can be summarized as follows: • We propose a mutual distillation learning network model, MainTUL, to solve TUL problem for sparse check-in trajectory data. Our MainTUL effectively leverages history data by trajectory augmentation and knowledge distillation to improve model performance. • We design a temporal-aware transformer trajectory encoder in MainTUL to capture the long-term time dependencies in the augmented trajectories. With the designed trajectory encoders, our model achieves mutual distillation of information. • We conduct extensive experiments on two real-life check-in mobility datasets. Results show that our model significantly outperforms state-of-the-art baselines by average 14.95% and 14.11% improvements in terms of Acc@1 and Macro-F1. Trajectory similarity measures have been widely used to explore the similarity of users from their spatiotemporal trajectories. Examples include LCSS [Ying et al., 2010] , DTW [Keogh and Pazzani, 2000] , Spatio-Temporal Linear Combine distance [Shang et al., 2017] , and Spatiotemporal Signature [Jin et al., 2019] . However, such measures only consider spatial or spatio-temporal proximities and cannot capture the temporal dependencies in trajectory data. Recently, a variety of studies that focus on deep representation learning have been proposed for trajectory similarity computation [Li et al., 2018; Yao et al., 2019 Yang et al., 2021] . However, these studies focus more on improving the efficiency of trajectory similarity computation. The introduction of TUL problem [Gao et al., 2017; Zhou et al., 2018] has further promoted the progress of deep trajectory representation learning in spatio-temporal data mining. Several methods [Miao et al., 2020; Zhou et al., 2021b; have been proposed to solve the TUL problem with deep neural networks. TULVAE [Zhou et al., 2018] incorporates VAE model into TUL problem to learn hierarchical semantics of check-in trajectories in RNN. Deep-TUL [Miao et al., 2020] proposes recurrent networks with attention mechanism to model higher-order and multi-periodic mobility patterns by learning from historical data to alleviate the data sparsity problem. AdattTUL and TGAN [Zhou et al., 2021b] introduce Generation Adversarial Network (GAN) to deal with the TUL problem. Recently, SML-TUL [Zhou et al., 2021a] uses contrastive learning to learn the predictive representations from the user mobility itself constrained by the spatio-temporal factors. Nevertheless, these methods use RNNs for modeling or prediction, which cannot effectively capture long-term time dependencies, and all methods ignore the effective use of historical data. Hinton et al. [2015] first proposed the concept of Knowledge Distillation (KD) in teacher-student architecture, which seeks to provide another pathway to gain knowledge about a task by training a model with a distillation loss in addition to the task loss. Ruffy and Chahal [2019] validate that appropriately tuned classical distillation in combination with a data augmentation training scheme provides orthogonal improvements. Recently, Zhao et al. [2021] propose a trainable mutual distillation learning model, which improves end-to-end performance more effectively than traditional teacher-student framework. In this work, we are the first attempt to use mutual distillation strategy to effectively utilize knowledge extracted from historical data to improve TUL performance. Definition 1 (Check-in Record). A check-in record is a triple u, t, p that represents user u visiting POI p at time t, where p denotes a uniquely identified venue in the form of (id, category, ), representing the location of the POI. Definition 2 (Trajectory). A trajectory is a sequence of check-in records ( u, t 1 , p 1 , u, t 2 , p 2 , . . . , u, t m , p m ) generated by user u in chronological order during a certain time interval τ , which is denoted by T r u τ . A trajectory is called unlinked, if we do not know the user who generated it. The time interval τ in our work is set to 24 hours. We now state our problem as below: Problem (Trajectory-User Linking). The task of TUL is to identify anonymous trajectories with the users who generate them. Let T = {T r 1 , T r 2 , . . . , T r n } represent the set of unlinked trajectories, and U = {u 1 , u 2 , . . . , u m } denote a set of users. Our goal is to find the mapping function f (·) satisfying the following condition: where y i is the true label of trajectory T r i , · represents difference evaluation operator (e.g., if i = j, u i − u j = 0, otherwise 1), and F is hypothesis space of TUL task. The architecture of MainTUL is presented in Figure 1 . Main-TUL contains three major components: check-in embedding, trajectory encoder, and mutual distillation network. This module contains two sub-modules: trajectory augmentation and check-in embedding layer. For check-in data, due to its inherent sparsity, the number of user check-ins in a sub-trajectory within a time window is very limited, while its long-term historical trajectory often implies more user movement patterns. Therefore, we explore two different trajectory augmentation strategies to generate the long-term trajectories for mutual distillation learning: • Neighbor Augmentation: Given an input trajectory T r in the time interval τ i , we use k + 1 sub-trajectories in time interval [τ i−k/2 : τ i+k/2 ] to form a long-term augmented trajectory in chronological order. Temporal-aware Transformer Encoder ℎ ℎ ℎ - Layer θ Figure 1 : The architecture of the proposed MainTUL framework. • Random Augmentation: For the input trajectory, we randomly sample k sub-trajectories from all trajectories of the corresponding user, and then form a long-term trajectory together with the input trajectory according to the time information. The goal of multi-semantic check-in embedding is to generate dense representations for each part in a check-in (such as location, category, and time slice) from sparse one-hot representations. In this way, we not only avoid the curse of dimensionality, but also capture contextual semantic information. For each trajectory, we actually have two sequence forms, one is a sequence of POIs, and the other is a sequence of POI categories. We divide a periodic time interval (e.g., one day) into multiple time slices (e.g., one hour for one time slice) and map check-in timestamps into corresponding time slices. Finally, we separately learn the embeddings for two sequences integrating with time slices as follows: where p i , c i and t i are one-hot encodings for the i-th POI, its category and time slice in the sequences, and W p , W c , W t , b p , b c and b t are learnable parameters. To learn higher-order transition patterns of check-in trajectory sequences, trajectory encoders need to be designed. We note that the sequence lengths of input trajectories and augmented long-term trajectories differ greatly. Overly complex trajectory encoder is not suitable for processing shorter trajectories, and overly simple trajectory encoder cannot capture the long-term time dependencies of long trajectory sequences. Therefore, we design two different trajectory encoder based on RNN model and temporal-aware self-attention network respectively. RNN is an efficient architecture for processing simple variable-length sequences. Due to the sparse nature of checkin trajectory, we use a popular RNN variant, Long Short-Term Memory network (LSTM) [Hochreiter and Schmidhu-ber, 1997; Huang et al., 2019] , as the encoder f θ (·) for processing the input trajectory. For both check-in POI and category sequences, we use a shared encoder f θ (·) and choose the hidden layer embedding of last time step as the representation of the two sequences. Then we use a learnable parameter α to balance the weights of the two representations and map them to user dimension via a multilayer perceptron (MLP) to obtain the final representation of the input trajectory: . . , m} and X c in = {x c i |i = 1, 2, . . . , m} denote the embeddings of check-in POI sequence and category sequence of input trajectory, respectively. The Transformer architecture [Vaswani et al., 2017] is very expressive and flexible for both long-and short-term dependencies, which is proven to be superior to traditional sequence models in dealing with long sequences [Yang et al., 2019; . However, for a long trajectory sequence, it should be noted that there is still a problem, that is, the time slice information cannot fully reflect the information changes of the trajectory in the time dimension (e.g., relative visit time difference). Therefore, inspired by [Zuo et al., 2020; , we design a temporal-aware position encoder to replace the position encoder in the original transformer: where w i is a learnable parameter, i is the order of embedding dimension (i ≤ 2d), and t j is the visit timestamp for j-th check-in record. Therefore, for any two POIs or categories in the sequence, the relative visit time information can be captured by: We use the improved temporal-aware transformer encoder as trajectory encoder f φ (·) for the augmented long-term trajectory. Similarly, a shared encoder f φ (·) is used to process the check-in POI and category sequences, and pooling the embedded tokens of last layer to obtain the latent representations of the two sequences. Next, we also use a learnable parameter β to balance the weights of the two representations to obtain the final representation for the augmented long-term trajectory through a MLP predictor: where X p au and X c au denote the embeddings of check-in POI sequence and category sequence of augmented long-term trajectory, respectively. Notice that the detailed formulations of encoders f θ and f φ can be found at Appendix 1 Different from the traditional knowledge distillation [Hinton et al., 2015] , in this work, we do not strictly distinguish between teacher and student networks. Both are learning from scratch rather than compressing a new network from another deeper frozen model. Let T r in and T r au represent the input trajectory and the augmented trajectory respectively. Trajectory encoder f θ embeds T r in into z θ in . Similarly, T r au is encoded into z φ au via encoder f φ . We expect encoder network f θ can be trained similar to the true label y, and the knowledge of long-term trajectory with more representation ability can be transferred to f θ . During the training, the loss function L 1 is as: where H(·, ·) refers to the conventional cross-entropy loss and KL(·, ·) to the Kullback-Leibler divergence of softmax ϕ and log-softmax ψ. T is the temperature, intended to smooth outputs and λ is a balancing weight. To maximize the use of training data, we propose a mutual distillation strategy, that is, exchange the input trajectory and the augmented trajectory for retraining, so that the two trajectory encoders can see more data to enhance the discriminative ability. In the mutual distillation strategy, encoders f θ and f φ are trained collaboratively and treated as peers rather than student/teacher. Specifically, for input trajectory T r in and augmented trajectory T r au , we swap and send them to different encoders to obtain new representations z φ in and z θ au , and calculate the loss function L 2 as follows: This operation can also be regarded as a data augmentation strategy. Although the RNN encoder is not specially designed for long-term trajectory data, it is also beneficial to use more trajectory sequences related to the input trajectory during training. Our final loss function for optimizing mutual distillation network is: L total = L 1 + L 2 . (9) Notice that in TUL prediction stage, there is no data augmentation operation. That is, in testing, each input trajectory is encoded by trajectory encoder f θ , and then is linked with the predicted user label. 1 https://github.com/Onedean/MainTUL/tree/main/appendix We use two real-world check-in mobility datasets [Liu et al., 2014; Yang et al., 2015] collected from two popular location-based social network platforms, i.e., Foursquare 2 and Weeplaces 3 . For Foursquare and Weeplaces, we choose top 800 and 400 users with the most check-ins for evaluating model performance respectively. In experiments, we use the first 80% of sub-trajectories of each user for training and the remaining 20% for testing, and select 20% training data as the validation set to cooperate with the early stop mechanism to find the best parameters and avoid overfitting. The statistics of two datasets are summarized in Table 1 . We consider the following baselines for comparison. • Classical methods: (1) LCSS -a common and effective trajectory similarity measure method [Ying et al., 2010] . (2) Signature Representation (SR) -a state-ofthe-art trajectory similarity measure for moving object linking [Jin et al., 2019 [Jin et al., , 2020 . • Machine learning methods: (3) Linear Discriminant Analysis (LDA) -a classic spatial data classification method [Shahdoosti and Mirzapour, 2017] . (4) Decision Tree (DT) -an effective classification method for trajectory data [Jiang, 2018] . • Deep neural network models: (5) TULER -an RNN model for TUL task [Gao et al., 2017] , including three variants: RNN with Gated Recurrent Unit (TULER-G), LSTM (TULER-L) and bidirectional LSTM (Bi-TULER). (6) TULVAE -It utilizes VAE to learn the hierarchical semantics of trajectory in RNN [Zhou et al., 2018] . (7) DeepTUL -a recurrent network with attention mechanism for TUL task [Miao et al., 2020] . We use the Acc@k, Macro-Precision, Macro-Recall and Macro-F1 to evaluate the model performance. Specifically, Acc@k is used to evaluate the accuracy of TUL prediction. For baselines, we use the parameter settings recommended in their papers and fine-tune them to be optimal. For Main-TUL, we set check-in embedding dimension d to 512, λ to 10, use early stopping mechanism, and set patience to 3 to avoid over fitting. The learning rate is initially set to 0.001 and decays by 10% every 5 epochs. We repeat 10 runs for each experiment and report the average for all methods. More experimental settings can be found in the appendix. We report the overall performance with deep neural network models in Table 2 , where the best is shown in bold and the second best is shown as underlined. The comparison with classical learning methods are shown in Figure 2 . As shown in Table 2 and Figure 2 , MainTUL significantly outperforms all baselines in terms of all evaluation metrics on two real-world check-in datasets. Specifically, MainTUL achieves average 14.95% Acc@1 and 14.11% Macro-F1 improvement in comparison to the bested performed baseline on two datasets. The main reason is that our designed mutual distillation model based on two different trajectory encoders captures the spatio-temporal movement patterns of users' check-in trajectories more effectively than RNN-based models (e.g., TULER and DeepTUL). In addition, contextual features such as POI category and temporal information are integrated in MainTUL to further improve the performance. We also observe that model performance on data with more users is worse than that on data with fewer users. This is intuitive because the more users the more difficult the classification becomes. However, the performance of our Main-TUL is only reduced by 3.68% in Macro-F1 from |U| = 400 to |U| = 800, while state-of-the-art model (i.e., DeepTUL) is reduced by 8.13%-9.22%. For DeepTUL, considering the historical data of all users does have a certain improvement on the data with fewer users, but when the number of users is large, a large amount of history data will also bring more noise, resulting in a sharp drop in performance. However, our model only uses the historical data of same user for trajectory enhancement and knowledge distillation during training, and does not require historical data for testing, and thus still performs better on the data with more users. Notice that SML-TUL [Zhou et al., 2021a] and TGAN [Zhou et al., 2021b] are not compared in our experiments due to no publicly available source codes. However, our MainTUL significantly outperforms SML-TUL and TGAN in terms of Acc@k and Macro-F1 on Foursquare according to the results reported in [Zhou et al., 2021a] , even if MainTUL links more users (e.g., MainTUL vs. SML-TUL vs. TGAN: 60.58% vs. 57.23% vs. 53.00% in Acc@1, and 59.12% vs 52.66% vs. 47.76% in Macro-F1 on Foursquare). In our ablation study, we compare our model with the following three carefully designed variations: (1) TUL-CA -This variation removes the category and time information in check-in embedding layer. (2) TUL-TA -It uses position encoding in [Vaswani et al., 2017] to replace our proposed temporal-aware position encoding. (3) TUL-MUT -It removes loss function L 2 to prove the importance of extracting knowledge from each other. The results of ablation study are shown in Figure 3 . As we can see, the key components all contribute to performance improvement of our model. We can observe that TUL-MUT performs the worst in most cases, indicating that the mutual distillation strategy has the greatest impact on the performance improvement of our model. The comparisons between TUL-CA, TUL-TA and MainTUL reflects the importance of the contextual features and temporal-aware position encoding, respectively. The results in Figure 3 demonstrate that temporal-aware position encoding has a greater impact on model performance on Foursquare dataset, while contextual features have a greater effect on Weeplaces dataset. We also evaluate the effectiveness of each term in our loss function (Eq. (9)). Table 3 depicts the experimental results of different loss functions on two datasets with 400 users. First, we can see that removing term λT 2 KL(·) leads to the decrease in performance, which demonstrates the importance of dark knowledge of long-term augmented trajectory. Second, we notice that removing term H(y, z θ in ) and H(y, z φ in ), i.e., without considering the input trajectory labels, the model performance does not drop significantly. This indicates that our model can still achieve better results only through the knowledge distillation of augmented trajectories. We also evaluate the sensitivity of our MainTUL with respect to different settings of temperature T and hyperparameter λ in our loss function. The results on Foursquare are shown in Figure 4 . As we can see, the performance first increases and then decreases, as T increases. This is intuitive because lower temperature the distribution more sharp, and higher temperature makes the distribution unable to extract effective information. The same observation is also presented on hyperparameter λ. In addition, it is clear that the performance increases rapidly with λ increasing from 0.1. This suggests our proposed knowledge distillation module contributes a lot to the overall performance. We also evaluated the combination selection of different types of trajectory encoders. The results on Foursquare are shown in Table 4 . We can conclude that when using the same type of trajectory encoders at the same time, the model performance is poor, which validates the rationality of our design of two different types of trajectory encoders. In the case of using two different encoders (i.e., f θ uses RNN encoder and f φ adopts transformer encoder), using encoder f θ for testing can achieve the best performance, which also shows that the simple encoder is more suitable for capturing the movement patterns of sparse sub-trajectories. Finally, we evaluate the proposed two trajectory augmentation strategies. Based on the results on Foursquare in Table 5, we have two observations: (1) Both data augmentation strategies help improve model performance. (2) The random augmentation is better than the neighbor augmentation. The reason is that the random strategy may obtain the potential movement patterns of users by combining users' past travels randomly, resulting in better results. In this paper, we propose a novel mutual distillation learning network (MainTUL) to solve TUL problem for sparse checkin mobility data. MainTUL effectively learns user movement patterns for input trajectory by the designed mutual distillation network consisting of two different trajectory encoders with multi-semantic check-in embeddings. Experiments on two real-world check-in mobility datasets demonstrate that MainTUL significantly outperforms state-of-the-art baselines in terms of all evaluation metrics for TUL prediction. Temporal multi-view graph convolutional networks for citywide traffic volume inference Identifying human mobility via trajectory embeddings Adversarial mobility learning for human trajectory classification Understanding the urban pandemic spreading of covid-19 with real world mobility data Distilling the knowledge in a neural network Long short-term memory Deepcrime: Attentive hierarchical recurrent networks for crime prediction Mist: A multiview and multimodal spatial-temporal learning framework for citywide abnormal event forecasting A survey on spatial prediction methods Moving object linking based on historical trace Trajectory-based spatiotemporal entity linking Scaling up dynamic time warping for datamining applications Deep representation learning for trajectory similarity computation Analyzing location predictability on location-based social networks Geography-aware sequential location recommendation Pre-training context and time aware location embeddings from spatial-temporal trajectories for user next location prediction Exploiting geographical neighborhood characteristics for location recommendation Geo-alm: Poi recommendation by fusing geographical information and adversarial learning mechanism Trajectory-user linking with attentive recurrent network The state of knowledge distillation for classification Spectralspatial feature extraction using orthogonal linear discriminant analysis for classification of hyperspectral data Trajectory similarity join in spatial networks Attention is all you need Hierarchically structured transformer networks for finegrained spatial event forecasting Modeling user activity preference by leveraging user spatial temporal characteristics in lbsns T3s: Effective representation learning for trajectory similarity computation Computing trajectory similarity in linear time: A generic seed-guided neural metric learning approach A linear time approach to computing time series similarity based on deep metric learning Mining user similarity from semantic trajectories Trajectory similarity learning with auxiliary supervision and optimal matching Mutual-learning improves end-to-end speech translation Self-supervised human mobility learning for next location prediction and trajectory classification Improving human mobility identification with trajectory augmentation Transformer hawkes process This work is partially supported by the National Natural Science