key: cord-0506459-x2p1hwk1 authors: Xu, Luyang; Liu, Haoyu; Song, Junping; Li, Rui; Hu, Yahui; Zhou, Xu; Patras, Paul title: TransMUSE: Transferable Traffic Prediction in MUlti-Service EdgeNetworks date: 2022-03-04 journal: nan DOI: nan sha: d66ece847cad418c3f722e4e382beea81af4ffdb doc_id: 506459 cord_uid: x2p1hwk1 The Covid-19 pandemic has forced the workforce to switch to working from home, which has put significant burdens on the management of broadband networks and called for intelligent service-by-service resource optimization at the network edge. In this context, network traffic prediction is crucial for operators to provide reliable connectivity across large geographic regions. Although recent advances in neural network design have demonstrated potential to effectively tackle forecasting, in this work we reveal based on real-world measurements that network traffic across different regions differs widely. As a result, models trained on historical traffic data observed in one region can hardly serve in making accurate predictions in other areas. Training bespoke models for different regions is tempting, but that approach bears significant measurement overhead, is computationally expensive, and does not scale. Therefore, in this paper we propose TransMUSE, a novel deep learning framework that clusters similar services, groups edge-nodes into cohorts by traffic feature similarity, and employs a Transformer-based Multi-service Traffic Prediction Network (TMTPN), which can be directly transferred within a cohort without any customization. We demonstrate that TransMUSE exhibits imperceptible performance degradation in terms of mean absolute error (MAE) when forecasting traffic, compared with settings where a model is trained for each individual edge node. Moreover, our proposed TMTPN architecture outperforms the state-of-the-art, achieving up to 43.21% lower MAE in the multi-service traffic prediction task. To the best of our knowledge, this is the first work that jointly employs model transfer and multi-service traffic prediction to reduce measurement overhead, while providing fine-grained accurate demand forecasts for edge services provisioning. Edge computing pushes computation and data storage closer to the user, thereby improving response times and saving communication bandwidth, while serving multiple applications simultaneously, e.g., video streaming, gaming, content delivery, etc. As people work increasingly more often remotely following the Covid-19 outbreak and require network support for different services, the edge computing paradigm is witnessing growing uptake. In order to optimise user experience and operational costs, infrastructure providers have been pursuing dynamic provisioning of network resources based on predictions of user demand [1] . Previous efforts in tackling network traffic prediction frequently exploit the ability of deep neural networks (DNNs) to learn complex patterns from historical data [2, 3, 4, 5, 6, 7] . However, existing solutions either require training one dedicated model for each geographic region and hence have limited transferability (which is of paramount importance in reducing computational costs and the environmental footprint of training DNNs) [2, 3, 4, 5] , or disregard essential correlations among services [6, 7] . In practical large-scale network deployments (i) perservice patterns are often distinct within a region, as exemplified in Figure 1 , while (ii) certain areas may exhibit similar characteristics that would allow for direct transfer of models among them, without the need of retraining. These key observations are confirmed by our analysis of a realworld network traffic dataset collected in a major city in Sichuan province, China, serving 2.6 million users, spanning 6.3 square kilometers, and comprising eight edge nodes. This motivates us to propose TransMUSE, a transferable traffic prediction framework in multi-service edge networks, which first groups edge-nodes according to per-service statistical features. Within each cohort, reference neural models are chosen and trained on data collected only in the region with the highest overall traffic consumption, which can be then transferred to other group members. As reference model, we put forward a Transformer-based [8] Multiservice Traffic Prediction Network (TMTPN). Furthermore, we propose WK-means, a service clustering algorithm based on Wasserstein distance to categorize services according to their similarity. We train separately a TMTPN model for each service cluster to boost prediction performance at a regional level. Finally, the reference models are transferred to other regions directly, without adaptation. Our proposed model transfer framework, TransMUSE, provides a comprehensive and cost-effective solution for traffic prediction in multi-service edge networks. The key advantages of TransMUSE are as follows: (i) it provides a model transfer approach among edge nodes to reduce measurement and computational overhead without compromising prediction accuracy -compared with training a model individually on local data for each edge node, TransMUSE exhibits imperceptible performance degradation, with only 1.7% and 0.26% higher MAE and RMSE, respectively; (ii) the proposed TMTPN takes service correlation into consideration to further reduce overhead and the energy that would have otherwise been required to maintain a separate prediction model for each service; our experiments demonstrate that TMTPN outperforms the state-of-the-art MTNet benchmark [2] on the multi-service traffic prediction task by 18.74% and 18.49%, in terms of MAE and respectively RMSE; (iii) the WK-means service clustering tackles both model under-fitting and speed of convergence, improving the TMTPN prediction performance, as it attains 17.59% and 27.89% lower MAE and respectively RMSE, as compared to predicting without prior service clustering. To the best of our knowledge, TransMUSE is the first multi-service traffic forecasting solution for edge networks that leverages model transfer and service clustering to achieve high accuracy at a low measurement cost. The rest of the paper is organised as follows. The multiservice prediction problem is formalised in Section 2. The proposed TransMUSE framework is discussed in detail in Section 3. Section 4 provides exhaustive experimental results to demonstrate TransMUSE's efficacy. Section 5 discusses the most relevant related work and Section 6 concludes the paper. Our aim is to address the challenges of handling spatial heterogeneity of service traffic in edge networks and reducing model training costs when forecasting future demands in edge networks, to support the effective management of their resources. Formally, multi-service traffic forecasting seeks to maximize the probability that, given previous measurements of the traffic volume consumed by services, the predicted traffic consumption over future time steps is as close as possible to the ground truth. Denoting by the traffic volume of service at timestamp and  ∶= [ 1 , ..., ] the snapshot of all services at time , and considering a forecasting model that is parameterized by , the multiservice traffic forecasting problem is equivalent to: To solve this problem, we design a Transformer-based Multi-service Traffic Prediction Network (TMTPN) that captures temporal correlations among traffic time series via multi-head attention, then improve forecasting accuracy via service-clustering, as we detail next. We propose TransMUSE, a deep learning framework for accurate and cost-effective multi-service forecasting at the network edge. Figure 2 gives an overview of the different components this framework entails and the relationship between them, namely: 1. Edge Node Clustering: We cluster edge nodes by a set of service-level statistical features using the K-means algorithm, to determine the neural model transfer scope. 2. Reference Node Selection: Within each scope, we select the node with the overall highest traffic consumption as the reference node; reference neural models for forecasting will be trained with data collected at such nodes. 3. Service Grouping: As certain mobile services exhibit statistical similarities, we cluster services using a modified K-means algorithm based on Wasserstein distance, aiming to reduce the number of multiservice neural models to be employed for prediction. 4. Model Training: At the level of each reference node, we train a dedicated TMTPN model for each service cluster, which will simultaneously predict the volume of traffic for all services within such clusters. 5. Model Transfer: We transfer the trained reference models from each reference edge node to all other nodes within the corresponding clusters, where they will be applied for inference without further training. Next, we discuss in details the key stages of our TransMUSE framework. Figure 2 : The proposed TransMUSE framework incorporates five stages: 1) Edge nodes are grouped into several node clusters, within which model transfer is to be conducted; 2) In each cluster, the edge node with the largest traffic volume is selected as reference (highlighted with hashed patterns); 3) At each reference node, services are further partitioned into service clusters by WK-Means; 4) One TMPTN is trained for each service cluster; 5) The models trained on reference nodes are transferred to the recipients (highlighted on the right) within the corresponding node clusters. In Multi-access Edge Computing (MEC) scenarios, it is often impractical to train a neural model at each individual edge node, as their computational power is limited and the operational costs and energy expenditure can become prohibitive to operators when deployment density increases. Edge model transfer aims to reduce the cost of measurement collection and model training, by confining these tasks to designated nodes and reusing models trained there on other nodes, without further local tuning. Different from cloud-edge approaches where a central node maintains a global model refined through model updates resulting from local training (federated learning), edge model transfer only considers the model to be transferred among edge nodes without the need for a central cloud. This brings additional merits in terms of data privacy and communication overhead reduction, as the transfer process is confined within a limited scope. A model to be transferred is called a reference model, the node where a reference model is trained is called a reference node, and the edge nodes that adopt it are referred to as recipients. There are two key issues to address in the edge model transfer process. The first, is determining the scope of model transfer. Different edge nodes may observe distinct traffic patterns due to geographic dissimilarities in terms of mobile user demographics [9] or socioeconomic function (residential areas, business districts, shopping centers, etc.). Edge model transfer, therefore, should be applied across edge nodes (within a cluster) with similar traffic features. Secondly, choosing at which edge node to train a reference model to be transferred within the corresponding cluster will impact the inference accuracy. We put forward an edge model transfer strategy that deals with these two issues as follows: • Determining Transfer Scope: We use K-means clustering to group edge nodes according to four statistical features, i.e., mean, standard deviation, maximum and minimum value of traffic volume for each service over one month. With 20 services, this leads to vectors of shape 80 × 1 that represent an edge node. A model will only be transferred within the same cluster of edge nodes. • Reference Model Training: Within each cluster, a reference model will be trained only with data from the edge node where the overall highest traffic consumption is observed. Reference models are then transferred to the recipients within the corresponding clusters. The results we present in Section 4 confirm the generalization abilities of this approach. At the level of a reference node, a set of neural models will be trained, each of which targets future traffic predictions for groups of services with similar characteristics, as we explain next. Traffic patterns and volumes may differ among services due to content popularity, number of service subscribers, service scope, etc. (see Figure 1 ), leading to high information entropy if observing all services together. Therefore, training a single model to predict the demand of all services may lead to under-fitting problems, because the model may need to learn highly convoluted patterns. To tackle this issue, we propose a service clustering algorithm based on the Wasserstein distance (WD) between per-service time-series data points, which we name WK-means. This facilitates effective simultaneous predictions of future traffic volumes for services with similar sequential features. There are two key factors to consider when measuring time-series similarity, namely, magnitude and 'shape'. The former indicates how comparable the traffic volume of different services is; the latter indicates any similarities in terms of periodicity and short-term temporal patterns. The WD takes these two factor into consideration at the same time, which makes it particularly suitable for our grouping task. Originally the WD was proposed to measure the similarity between two probability distributions, and was recently employed in optimal transportation problems [10] : where  and  are two probability distributions in ℝ , and Γ(, ) is the set of all probability measures on ℝ ×ℝ with marginals  and . ( , ) is a measure of distance between ∈  and ∈  (e.g., = 2 for Earth mover's distance). Intuitively, the WD represents the minimal distance for moving the mass of distribution  to exactly fit the mass of distribution . Unlike other distance metrics, such as Euclidean distance, Jensen-Shannon (JS) divergence or Kullback-Leibler (KL) divergence, WD has the following key advantages: (i) if the target distributions lie in low-dimensional manifolds or share disjoint support, which is not uncommon for highdimensional data, WD offers a more informative measure (which is not the case for KL and JS divergence that return a constant value or infinity) [11, 12] ; and (ii) WD maintains the underlying geometry of the space [13] , that is, it not only takes the quantitative value into consideration, but also pays attention to the similarity of distributions' shapes. In contrast, the Euclidean distance cannot quantify shape differences or capture the degree of changes between two times series [14] . Based on WD, we propose the WK-means service clustering algorithm, summarized by the pseudo-code in Algorithm 1. To generate clusters from services, WKmeans initially sorts all the services by their volume and splits the sorted sequence at , 2 , ..., ( −1) . That is, the sorted sequence is evenly divided into segments (lines 2-6). WK-means further chooses the service in the middle of each segment as cluster center (lines 8-13). Instead of using random initialization, this approach speeds up the convergence process. Then, the WD between each service and sub-cluster center is calculated, and each service is reassigned to its nearest sub-cluster (lines [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] . Finally, each sub-cluster center is updated (line 25) and the previous two steps are iterated until all sub-clusters convergence or the iteration epoch reaches a predefined limit (lines 7-26). In a setting with multiple edge nodes, it is possible that WK-means may generate different service clustering results on different nodes. To comply with our model transfer strategy, we first conduct WK-means at the level of every edge-node and select the most frequent clustering pattern as the global service grouping. Such pattern is applied to all other nodes in our design. Require: (list::service time series), (int::cluster), (int::max iteration) Ensure: (list::cluster results) 1: To perform multi-service traffic forecasting, we design Transformer-based Multi-service Traffic Prediction Networks (TMTPNs), each of which inherits from the canonical Transformer architecture and is dedicated to each individual service cluster. Transformers have shown remarkable performance in processing sequential data and have been adopted previously for natural language processing [15] , computer vision [16] , and vehicular traffic prediction [17] tasks. Our TMTPN model is illustrated in Figure 3 and follows an Encoder-Decoder paradigm, encompassing the following components: • Multi-Head Attention: Multi-head attention consists of multiple scaled dot-product attention structures that capture temporal dependencies in long sequences. The attention block receives three inputs: ∈ ℝ × (query), ∈ ℝ × (key) and ∈ ℝ × (value), in which represents the sequence length and is the embedding dimension of each item in the sequence. Attention is computed as: generates a × matrix of alignment scores, where each entry denotes the correlations between two instances in the sequence. The matrix is scaled and then multiplied by to generate the hidden representation of the input that incorporates attention information. Multi-head attention splits , and into multiple chunks, which are processed with independent attention blocks. The outputs of all the attention blocks are concatenated and projected back into hyperspace ℝ . Specifically, the first attention block in the decoder computes the self-attentional representations of the decoder input, and the second block takes the encoder output as the key ∈ ℝ × and the value ∈ ℝ × , querying which historical inputs are important when making future predictions. • Positional Encoding (PE): Since transformers do not contain any sequential structure, timing features are not encoded in the network by default. Therefore, positional encoding is added to the input sequence, which reflects the relative position of each timestamp. PE is computed as: ( , 2 ) = sin( ∕10, 000 2 ∕ ), where denotes the position index of the item in the sequence and is the dimension of the encoded position. • Parallel Decoding: Traditional seq2seq models [7] perform decoding in an auto-regressive manner during training. That is, decoding the ℎ element in a sequence relies on the hidden states passed from timestamp − 1 and the decoded ( − 1) th item, which are provided as the input. It is therefore impossible to decode all the items in parallel. Transformers overcome this problem during training by introducing the shifted decoder input and look-ahead mask. Assume that the ground truth to be provided to the decoder is = [ 1 , ..., ], then the input is the shiftedright ground truth = [0, 1 , ..., −1 ] . The lookahead mask ( ) is introduced when computing the alignment scores as follows: where = = , and M is a × matrix with each entry above the diagonal equal to negative infinity, and below/on the diagonal equal to 0. The scaled matrix of alignment scores is masked with M, which yield a × lower triangular matrix, meaning that at a given timestamp , there is no correlation ( = 0) with the input from any future timestamp ( > ). By masking, the decoder can approximate the output at timestamps,̂ = [̂ 1 , ...,̂ ], in parallel. This technique is only applied during training, while during testing the transformer decodes step-by-step, as seq2seq models. Overall, the proposed TMTPN architecture has several merits: (i) it can be trained fast due as the look-ahead mask and the shifted decoder input that facilitate parallelization; (ii) it can process longer sequences than traditional seq2seq We implement TransMUSE and its TMTPN models, as well as a set of benchmark neural models in Tensorflow v2.3.0 using the cuDNN v7.6 and CUDA v10.1 libraries. To demonstrate the performance gains of our solution, we train and evaluate the neural models and experiment on a largescale real-world wired network traffic dataset collected by a network operator in Sichuan Province, China. For this, we employ a high-performance computing cluster comprising 12 servers, each equipped with a 32-cores Intel E5-2620 CPU and running Red Hat Enterprise Linux, and accelerate the training process with multiple GPUs out of a pool of 96 Nvidia RT2080Ti units. We conduct three sets of experiments to demonstrate (1) multi-service traffic prediction performance gains attained by our TMTPN models; (2) the benefit of employing service clustering with WK-means; and (3) forecasting performance with edge model transfer. The dataset we employ was collected in a city with over 6 km 2 land coverage, administratively divided into 7 districts and 1 core urban area, and with a population of approximately 2.7 million inhabitants. The traffic within each district (D1 to D8) is handled by a dedicated edge node, and the high level structure of the deployment is illustrated in Figure 4 . Traffic data was collected by Deep Packet Inspection (DPI) via port mirroring, between July 1 st and 31 st , 2020. Traffic was aggregated at session level, with only application type, district identifier, direction (uplink/downlink), total volume and timing information (session start/end) being recorded, to preserved anonymity. In total, 20 service types are distinguished, as summarised in Table 1 , where these are sorted in descending order by their volume across the entire deployment, and indexed. The traffic volume distribution for the top-8 services is shown in Figure 5 , where bars depict the fraction of the overall volume and the line the corresponding values. Due to the commercially sensitive nature of the dataset, we cannot disclose the precise identity of the city, nor the specific service names for which traffic measurements were collected. Over the 31 consecutive days of measurements, we sample the traffic consumption every minute, assuming uniform consumption per session throughout their duration. This is reasonable, given the predominantly short-lived nature of sessions, leading to temporal sequences of 44,640 data points for each service in each region. We normalize service traffic volumes to the 0-1 range, to ensure similar magnitudes during training. We use an 80/10/10 data split for training, validation, and testing, and train models separately on a region-by-region basis. For comparison, we consider the following state-of-theart DL models as baselines: • LSTM, which is now a classic structure for tackling regression tasks, and has been extensively used for traffic prediction [18, 19] . We implement a threelayer LSTM, which offers an appropriate complexityeffectiveness tradeoff. Table 2 MAE and RMSE performance (in MB) on the multi-service traffic forecasting task with LSTM, AttentionAR, GraphConv, MTNet and our TMTPN across 8 districts. • MTNet, which was designed for multivariate time series prediction and adopts an encoder-decoder architecture to extract both long-and short-term hidden representations correlation among these [2] . • GraphConv, which is also aimed at tackling multiple time series predictions [20] , integrating graph convolution [21] and an LSTM network to extract correlations between multiple sequences and temporal patterns. We employ the Spektral library for our implementation [22] . • AttentionAR, a model that we implement based on the Bahdanau Attention structure [23] with rolling prediction through a LSTM cell. Attention is used to assign weights to historical input. To evaluate the performance of our proposed models and that of the benchmarks considered, we compute the Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). In essence, these metrics quantify the difference between ground-truth and predicted values, and are defined as follows [24] : where is the number of services, represents the number of prediction steps, and and̂ denote the ground truth and respectively predicted traffic volume for service at timestamp . We first examine TMTPN's performance vis-a-vis that of the benchmarks considered, then investigate the influence that input/output lengths have on this. We first train and test different models for every district separately, using traffic solely observed within each of these. We take input sequences of length 30 (i.e., 30-min historical data) and predict the traffic volume per service over 5 future timestamps. The obtained results are summarized in Table 2 , where lower MAE and RMSE values indicate superior prediction performance. Observe that the TMTPN models we propose consistently outperform the state-of-the-art neural networks considered. In particular, when compared with the second best model, MTNet, our TMTPN reduces the MAE and RMSE on avergae by 18.74% and respectively 18.49%, across the eight districts. This is because the multi-head attention structure adopted by our design allows the model to jointly extract information from different representation sub-spaces at different points in time, giving higher weights to the most significant historical patterns, to enhance prediction performance. The performance of LSTM and MTNet is relatively similar, while AttentionAR largely overestimates traffic demand, indicating that the traditional attention mechanism is not best-suited to multi-service prediction tasks. Finally, Graph-Conv performs poorly in comparison with our TMTPN and the other two benchmarks we consider. This is likely because no strong spatial relationships between the different services exist in edge network settings. To better appreciate the forecasting performance of our proposed TMTPN model at service level, in Table 3 we summarize that across two districts with dissimilar service usage patterns, namely D1 and D2, while in Figure 6 we illustrate 5-step prediction instances performed in the two districts across 5-hour windows (busy hours between 15:00 and 20:00 on 29 July, 2020) for 4 randomly selected services (chat, web video, live streaming, P2P video). We only compare TMTPN with LSTM in this and the subsequent experiments, because LSTM appears to be an effective deep learning model that achieves solid performance, which the results in Table 2 and prior work [6, 7] confirm. As can been seen from the figure, TMTPN is superior to LSTM, as it tracks more closely the ground truth traffic that would be available under ideal circumstances. This is especially clear to observe on 'Chat' traffic forecasting in D1 (sub-figure (a)) and 'Live Streaming' traffic in D2 (sub- figure (b) ). In this subsection, we evaluate the long-term and shortterm prediction performance of our TMTPN, focusing again on districts D1 and D2. We examine the MAE across all services as the forecasting horizon varies between 5 and Table 3 Per-service prediction performance in terms of MAE (MB) in districts D1 and D2. Service names given in Table 1 by index. 30 steps, while we also vary the input size, i.e., 5, 15 and 30 historical traffic snapshots. The obtained results are summarized in Figure 7 , where the x-axis represents the combination of input and output length (e.g., 30-5 indicates the model uses the previous 30 minutes traffic data to predict the upcoming 5 minutes traffic demand). The top sub-figure corresponds to district D1 and the bottom to district D2. Observe that TMTPN is consistently superior to LSTM, as it achieves lower prediction errors. The performance gains grow with the length of the forecasting horizon, with TMTPN reducing the MAE experienced with LSTM on long-term predictions by 43.21% and 40.77% in district D1 and D2, respectively. Benefits are also observable short-term, where TMTPN attains 15.02% and respectively 22.74% lower MAE than LSTM in the two districts, when the input and output lengths are both 5 . These gains can be attributed to the multi-head attention mechanism that our design adopts. In addition, the shifted input with lookahead mask not only enables training parallelization, but also ensures TMTPN can predict the future sequence on a rolling basis, unlike the LSTM, which predicts multiple future steps at once and is thus prone to larger errors. Lastly, we note that the input length has only marginal impact on TMTPN's forecasting accuracy, with input size impacting performance slightly differently at the level of the two districts examined. Yet in both cases the best performance is attained with 15 historical snapshots. Based on these results, we argue that if the input length is too short (5), the model may not be able to capture certain periodic information or longer trends. Recall that the aim of service clustering in TransMUSE is to further improve forecasting performance by grouping services into different clusters, according to their temporal similarity. Here, we demonstrate the benefits of using our WK-means algorithm for this task (hereafter denoted as WASS), as compared to three benchmarks that can be applied to time series data, namely K-means clustering Table 4 Silhouette score comparison for different numbers of clusters and the four clustering algorithms considered, in district D1. Before comparing the clustering algorithms, the appropriate number of clusters needs to be determined. The Silhouette score is routinely employed to characterize clustering performance, which is computed as the difference between the mean of the intra-cluster distances and the mean of the nearest-cluster distances, normalized by the maximum between the two [27] . The silhouette score is in the [−1, 1] range, with larger values indicating higher quality clustering. With this, we validate the effectiveness of our WKmeans algorithm vis-a-vis that of the benchmarks considered, on the eight districts separately. For each district, is chosen in the {2, … , 5} range, and we compute the silhouette score for each value. The results on district D1 are given in Table 4 , which suggest = 2 is the optimum value. The same holds for the vast majority of other districts, with = 3 yielding marginally higher silhouette scores (0.01 difference) in 2 out of 32 instances. Hence we select = 2 for all the remaining experiments. Next we evaluate forecasting performance with TMTPN when a model is trained individually on services clusters, following grouping of the 20 services into = 2 clusters using the proposed WK-means and the benchmark algorithms. We resort again to MAE and RMSE for evaluation and summarize the results obtained in Table 5 . The results demonstrate that all clustering algorithms can reduce the prediction errors, which is more apparent in districts with larger traffic volumes, such as D1, D2, D3 and D8. Our WASS solution is superior to COS, because cosine similarity gives priority to the direction of two vectors, such as the semantic similarity between two sentences. DTW and EUC are essentially based on the Euclidean distance between traffic magnitude, whereas the "shape" of a time series is an important feature when measuring the similarity between two time series. Our WK-means algorithm (WASS) based on Wasserstein distance possesses such ability, which is reflected in the lower prediction errors obtained (bottom row in Table 5 ). To better appreciate where the differences in the performance attained with WK-means and the 3 benchmarks stem from, in Figure 8 , we visualize the cluster membership of the different services in a randomly chosen district (D2). DTW and EUC are based on Euclidean distance between each sample and the cluster center, and only services with an extremely large traffic magnitude are categorized into the same cluster. Cosine similarity pays more attention to the difference between two vectors in direction rather than distance or length. In our service clustering task, the traffic magnitude is the primary consideration, therefore cosine similarity is less effective in clustering service time series, which is also confirmed by our previous results reported in Table 5 , where COS performs worst than the other three algorithms in all 8 districts. Finally, we demonstrate the merits of model transfer in TransMUSE by showing that reusing models trained at reference nodes within a node cluster, with the aim of reducing computational overhead, does not impact negatively on the forecasting performance. Recall that the first step in transferring reference models is to decide the transfer scope. We use K-means clustering to group regions according to the statistical features of all the service traffic time series. We resort again to the silhouette score to determine the optimal number of region clusters, which we computer for ∈ {2, 3, 4} in Table 6 . We conclude that = 2 produces the highest score and districts D1, D2, D3, D5, and D8 should be grouped together, with the remaining 3 regions belonging to the second edge node cluster. Table 5 Clustering algorithms comparison based on MAE and RMSE (MB) of forecasts obtained with TMTPN applied on service clusters across the 8 districts. Service indexes are sorted in descending by traffic volume, and the service name can be obtained by mapping in Table 1 k 2 3 4 score 0.270 0.178 0.148 Table 6 Silhouette score of edge node clustering by K-means with different number of clusters . We order the regions according to the overall traffic volume and conclude that D4 and D3 are to be selected as reference nodes for cluster 1 and 2, respectively. We quantify the generalization ability of the models trained by comparing the RMSE when performing forecasting following model transfer (TransMUSE) versus when models are trained locally at individual region level (Original). To add further perspective and verify our hypothesis that models trained at edge nodes witnessing large traffic volumes have stronger generalization abilities, we also examine the forecasting performance when models are trained on regions where the traffic volume is the lowest among cluster members, prior to transfer (Ctrl-Exp). The result are illustrated in Figure 9 for the two clusters, where regions appear in descending order by the overall traffic volume. Observe that when the reference models are trained on regions with the highest traffic demand (Trans-MUSE), the RMSE values are almost identical to those obtained when training models individually at each edge node. The largest performance gap is at the level of D6, where a 0.26% performance degradation is observed in terms of forecasting accuracy (RMSE). In contrast, if reference models were to be trained at edge nodes with low traffic volumes (D6 in cluster 1 and D5 in cluster 2), the forecasting performance would suffer (Ctrl-Exp). Specifically, the averaged RMSE error over 8 regions is 9.26 MB, which is 92 times larger than with TransMUSE. We conclude that, as the number of edge nodes increases with the growing adoption of the MEC paradigm, our proposed model transfer strategy will help reduce training time and energy consumption. TransMUSE will only need to revisit cluster membership and will circumvent the need to persistently collect taffic data in each district. Network traffic prediction is critical to network resource management, optimization and QoS improvement. While this topic has received a lot of attention over the recent years, aspects including service-level traffic forecasting and predicting with low computational overhead have been largely overlooked. Here, we summarise the most relevant work related to our contribution. The main approaches to time sequence prediction are State Space Models (SSMs) and sequential models that frequently use deep learning (DL) [28] . (ARIMA) models and variants of these, which have been widely adopted for mobile traffic forecasting [29, 30, 31] . Their major drawback is that they require manual parameter selection on a sequence-by-sequence basis. In addition, they perform poorly when inputs exhibit high variability. DL has made advances in multiple domains, with Long-Short Term Memory (LSTM) models proven to be superior to traditional models such as ARIMA when predicting wired and wireless traffic [18, 32, 33, 34] . Given that spatial correlations exist between traffic generated at different base stations in wireless networks, LSTM models have been combined with Convolutional Neural Networks (CNNs) to tackle this problem. Zhang et al. proposed a ConvLSTM model to predict multi-service mobile traffic [7] and a graphsequence spatio-temporal model is introduced in [35] to forecast cellular traffic demand. More recently, attention and transformer architectures demonstrated the ability to handle long sentences in the NLP domain, which subsequently led to their adoption in time series forecasting tasks [6, 28] . However, none of these prior works builds on the observation that spatial correlations are weak in wired networks and correlations among services matter most. As edge computing is getting traction, there have been several research projects focusing on cloud-edge model training based on collaborative learning. He et al. design a collaborative global-local learning scheme that leverages the generalization capability of the global model and the personalization ability of local models to boost the training performance of a graph attention spatio-temporal network (GASTN) for city-wide mobile traffic prediction [6] . Yan et al. propose COLLA, a collaborative learning framework that allows devices and the cloud to learn collectively user locations [36] . Zhang et al. design a collaborative cloud-edge computation method for driving behavior modeling, which trains and prunes common models in the cloud and conducts transfer learning at the edge [37] . Cartel is proposed in [38] for cloud-edge collaborative learning, aiming at distributing and updating machine learning models across geographically distributed edge clouds. These works are mostly set on the premise that there exists plenty of data in the cloud to train global models. Edge-edge collaboration, in scenarios where data is largely available only at the network edge, has received less attention. Further, the cost of data transfer overhead has been thus far overlooked, which is non-negligible for network operators. In this paper, we tackled network traffic prediction in multi-service edge networks with spatially heterogeneous demands. We proposed TransMUSE, a framework that groups edge nodes into cohorts and trains transformerbased (TMTPN) models at reference locations, which can be transferred within cohorts without any adaptation. By means of extensive experiments with real-world data, we demonstrated TransMUSE's forecasting performance is comparable with that of training individual models with local data at each node. We further propose WK-means, a service clustering routine, which allows to reduce the number of TMTPN models to be maintained for forecasting, based on service similarities. All of these facilitate accurate shortand long-term multi-service traffic prediction with reduced measurement and training costs, which is essential for finegrained network management. Spider: Deep learning-driven sparse mobile traffic measurement collection and reconstruction A memory-network based solution for multivariate time-series forecasting A study of deep learning networks on mobile traffic forecasting Long-term mobile traffic forecasting using deep spatio-temporal neural networks Spatiotemporal analysis and prediction of cellular traffic in metropolis Graph attention spatialtemporal network with collaborative global-local learning for citywide mobile traffic prediction Multi-service mobile traffic forecasting via convolutional long short-term memories Attention is all you need Urban Vibes and Rural Charms: Analysis of Geographic Diversity in Mobile Service Usage at National Scale Wasserstein distance guided representation learning for domain adaptation Proceedings of the IEEE/CVF International Conference on Computer Vision Wasserstein generative adversarial networks Optimal transport and Wasserstein distance Research on shape-based time series similarity measure Pre-training of deep bidirectional transformers for language understanding Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Spatialtemporal transformer networks for traffic flow forecasting Mobile traffic prediction from raw data using LSTM networks Applying deep learning approaches for network traffic prediction Time Series Forecasting with Graph Convolutional Neural Network Semi-supervised classification with graph convolutional networks Spektral for Graph Deep Learning Neural machine translation by jointly learning to align and translate A first step towards distribution invariant regression metrics Tree-RNN: Tree structural recurrent neural network for network traffic classification Tslearn for the analysis of time series Cluster quality analysis using silhouette score Deep transformer models for time series forecasting: The influenza prevalence case Traffic models in broadband networks Wireless traffic modeling and prediction using seasonal ARIMA models Call detail records driven anomaly detection and traffic prediction in mobile cellular networks Multi-task learning at the mobile edge: An effective way to combine traffic classification and prediction Mobile edge computing: A survey Deep learning with long short-term memory for time series prediction Mobile demand forecasting via deep graph-sequence spatio-temporal modeling in cellular networks Collaborative learning between cloud and end devices: an empirical study on location prediction Collaborative cloudedge computation for personalized driving behavior modeling Cartel: A system for collaborative transfer learning at the edge