key: cord-0134259-bjcjhh7p authors: Meybodi, Zohreh Hajiakhondi; Mohammadi, Arash; Rahimian, Elahe; Heidarian, Shahin; Abouei, Jamshid; Plataniotis, Konstantinos N. title: TEDGE-Caching: Transformer-based Edge Caching Towards 6G Networks date: 2021-12-01 journal: nan DOI: nan sha: b99232a88e81c6688f346cae79755f9721815e76 doc_id: 134259 cord_uid: bjcjhh7p As a consequence of the COVID-19 pandemic, the demand for telecommunication for remote learning/working and telemedicine has significantly increased. Mobile Edge Caching (MEC) in the 6G networks has been evolved as an efficient solution to meet the phenomenal growth of the global mobile data traffic by bringing multimedia content closer to the users. Although massive connectivity enabled by MEC networks will significantly increase the quality of communications, there are several key challenges ahead. The limited storage of edge nodes, the large size of multimedia content, and the time-variant users' preferences make it critical to efficiently and dynamically predict the popularity of content to store the most upcoming requested ones before being requested. Recent advancements in Deep Neural Networks (DNNs) have drawn much research attention to predict the content popularity in proactive caching schemes. Existing DNN models in this context, however, suffer from longterm dependencies, computational complexity, and unsuitability for parallel computing. To tackle these challenges, we propose an edge caching framework incorporated with the attention-based Vision Transformer (ViT) neural network, referred to as the Transformer-based Edge (TEDGE) caching, which to the best of our knowledge, is being studied for the first time. Moreover, the TEDGE caching framework requires no data pre-processing and additional contextual information. Simulation results corroborate the effectiveness of the proposed TEDGE caching framework in comparison to its counterparts. Mobile Edge Caching (MEC) [1] , [2] is an emerging technology in the Beyond Fifth Generation (5G) communication networks (also referred to as 6G) developed to meet the phenomenal growth of the global mobile data traffic. Enabling caching at the edge of the network provides the opportunity to store popular content at the storage of the heterogeneous next generation Node B (hgNB) during the off-peak intervals [3] , [4] . After requesting a content by an edge (e.g., Internet of Thing (IoT)) device, this request is directly served by the neighboring hgNBs, having the requested content. In such This Project was partially supported by Department of National Defence's Innovation for Defence Excellence & Security (IDEaS), Canada. scenarios, the cache-hit occurs; otherwise, it is known as a cache-miss and the requested content is sent from the content server to the hgNB to serve the request [5] , [6] . Integrating Unmanned Aerial Vehicles (UAVs) as the flying hgNBs into the terrestrial MEC networks [7] extends the service coverage and improves the Quality of Service (QoS) of User Equipment (UE) in the Beyond 5G networks. Due to the limited local storage capacity of cache-enabled hgNBs, it is of significant importance to identify/store the most popular content to enhance the cache efficiency of the network. In the MEC networks, there are two types of caching strategies, i.e., Reactive caching and proactive caching. Conventional reactive caching schemes [8] , such as First-In-First-Out (FIFO), Least Recently Used (LRU), and Least Frequently Used (LFU) frameworks, identify the most popular content based on the underlying pattern of observed users' requests. A critical drawback of reactive caching is that popular content can only be identified after being requested. As a consequence, they are not robust to the dynamically changing behavior of the content popularity. Therefore, the main focus of recent researches has been shifted to use proactive caching, e.g., using Deep Neural Networks (DNN) models to predict the Content Popularity (CP) from the request patterns. In this context, popular content can be dynamically allocated in the storage of hgNBs before being requested. The paper aims to further advance this emerging field. Literature Review: Generally speaking, both temporal and spatial correlations exist within the time-variant request pattern of multimedia content. While spatial correlation reflects different users' preferences, depending on the geographical location and users' contextual information, the temporal correlation represents the variation of content popularity over time. In this context, several DNN models [9] - [17] are introduced to capture the temporal and/or spatial features of user preferences in proactive caching schemes. For instance, Yu et al. [18] used an auto-encoder model to predict users' preferences in the future by learning the latent representation of raw data in an unsupervised fashion. Auto-encoder models, however, suffer from training complexity. Tsai et al. [19] used Convolutional Neural Network (CNN) for predicting users' interests based on sentence analysis. Ndikumana et al. [20] introduced a DNN-based caching framework, compromising of Multi-Layer Perceptron (MLP) and CNN models, where contextual information such as age, emotion, and gender are utilized for making caching decisions. Although CNN-based proactive caching schemes have local spatial feature awareness, they are inefficient for extracting temporal features from the patterns of sequential requests. Furthermore, such models require multisource input such as regional information, and contextual information of users to improve the cache performance. Therefore, they need an efficient data pre-processing model to extract this information. To deal with the time-varying behavior of request patterns, Recurrent Neural Networks (RNNs), such as Long Short Term Memory (LSTM) [12] , [21] , are introduced to use historical information of the content. To extract both spatial and temporal features of CP data, Ale et al. [10] used a combination of LSTM and CNN models. LSTM-based caching frameworks, however, suffer from long-term dependencies, computation complexity, and unsuitability for parallel computing. To address challenges associated with RNN architectures, the Transformer neural network [22] has been designed to handle sequential input data, which is purely reliant on attention mechanisms with no recurrence or convolutions. One of the most important advantages of Transformers over RNN models is the attention mechanism, which eliminates the need to analyze data in the same order. Consequently, Transformers have higher parallelization capabilities than RNNs, implying reduced training time. Contributions: Motivated by the above discussion, we introduce a Vision Transformer-based Edge (TEDGE) caching framework with the application to the MEC networks. The proposed TEDGE framework learns the real-time caching strategy from sequential requests of multimedia content. The main objective of several recent time-series prediction models that have been applied to the multimedia content caching [23] , [24] is to predict the underlying patterns of the future multimedia content requests, i.e., the number of content requests using historical information. Considering the fact that the users' preferences remain unchanged for a while [14] , it is sufficient to predict the potential Top-K popular content using the learned patterns from historical requests. The main focus of this study, therefore, is to predict the Top-K popular content using historical information instead of predicting the number of upcoming requests. In summary, the paper makes the following key contributions: • The TEDGE caching framework is an edge-assisted intelligent caching framework that learns the caching strategy from the historical request patterns without relying on data pre-processing or feature engineering. More precisely, the TEDGE caching framework is a multi-label classification model with the aim of minimizing the difference between the actual Top-K popular content and the predicted ones. • To simultaneously analyze the sequential pattern of all content, the TEDGE caching framework employs a ViT architecture instead of using conventional Transformer models. The input of the ViT model is an image, where each pixel indicates the number of requests of each content in a specific time. Simulation results based on real-trace of multimedia requests illustrate that the proposed TEDGE caching framework outperforms its state-of-the-art counterparts in cache-hit-ratio. The remainder of the paper is organized as follows: In Section II, the system model is described and the main assumptions required for implementation of the proposed TEDGE framework are introduced. Section III presents the proposed TEDGE caching framework. Simulation results are presented in Section IV. Finally, Section V concludes the paper. We consider a UAV-aided cellular network with heterogeneous Radio Access Technologies (RATs) as the 6G network model. There are N h number of hgNBs, consisting of N u number of UAVs, denoted by u k , for ( The hgNBs are equipped with a limited cache size, denoted by K. As shown in Fig. 1 , FAPs are independently and randomly distributed in the environment following a Poisson Point Process (PPP) [25] . We also consider a Gaussian mixture distribution for UEs, leading to a dense population in some environments. Due to the movement of UEs, the population is changed over time, therefore, the location of UAVs are determined by the K-means clustering algorithm [26] , where UAVs remain hovering at their locations while serving a request for data delivery [2] . A Software Defined Network (SDN) controller is used to manage the aerial and terrestrial connections and control the link quality and topology of UAVs [2] . We denote a library of content C = {c 1 , . . . , c Nc }, where N c = |C| is the cardinality of contents in the network. For simplicity, it is assumed that the size of all contents c l , for (1 ≤ l ≤ N c ), are the same [2] , and UEs request at most one content in each time slot. CP in multimedia services follows Mandelbrot-Zipf (M-Zipf) distribution [27] , where the global probability of requesting content c l by all UEs, denoted by p l , is given by where γ and ζ represent the skewness and plateau factors, respectively, and term r is the rank of content c r , when all contents are sorted in descending order of their popularity. In addition, p represents the local probability of requesting content c l by IoT devices located in the coverage area of hgNB In this section, we present the TEDGE caching framework, which is designed to predict the Top-K popular content. To be specific, we first briefly introduce the dataset used in this study, and also present the preparation phase to adopt the dataset to the TEDGE caching framework. Then, we explain different blocks of the ViT architecture, which is used as the multilabel classification within the TEDGE caching framework. In this study, we use MovieLens Dataset [28] , which is one of the well-known movie recommendation services. In this dataset, movies with related information such as movie titles, release date, and genre are provided. Each content is requested by several users in different timestamps, where the contextual information of users such as age, gender, occupation, and their ZIP codes are also released. With the assumption that users leave a comment after watching a movie [12] , [29] , [30] and in order to extract the content request pattern, commenting on a content is considered as a request. Moreover, to identify the users' location in each timestamp, ZIP codes are converted to longitude and latitude coordinates [12] . Considering the limited transmission range of hgNBs, hgNBs' locations, and users locations, the available hgNBs for serving requests of all users will be determined. Our main goal in the TEDGE caching framework is to monitor the historical requests pattern of each content to predict the Top-K popular content in an upcoming time period. Therefore, the preparation of the dataset is performed in the following four steps: Step 1 (Request Matrix Formation): In the first step, the dataset is sorted for each content c l , for (1 ≤ l ≤ N c ), in the ascending order of time. Therefore, we form an (T × N c ) indicator request matrix for each hgNB, denoted by R, where T and N c represent the total number of timestamps and the total number of distinct content, respectively. In the request matrix, r t,l = 1 illustrates that content c l is requested at time t; otherwise, r t,l = 0. Step 2 (Time Windowing): Considering the fact that the most popular content should be cached at the storage of hgNBs during the off-peak time [3] , there is no need to predict the content popularity at each timestamp. We, therefore, define the updating time t u (i.e., the off-peak time), as the timestamp that the storage of hgNBs is updated by the new popular content. In this case, we will have a time window with the length of W, where W is associated to the time duration between two updating times, and the number of time windows is represented by N W = T W . Therefore, we have a (N W × N c ) window-based request matrix, denoted by R (W) , where r Step 3 (Data Segmentation): As mentioned previously, the main target of the TEDGE caching framework is to use the historical information of content to predict the Top-K popular content in the next updating time. Given the window-based request matrix R (W) , the collected request pattern data is segmented via an overlapping sliding window of length l. As it can be seen from Fig. 2 , the window-based request matrix R (W) is converted into D = {(X u , y u )} M u=1 , where M represents the total number of segments. Moreover, terms X u ∈ R M ×Nc and y u ∈ R Nc×1 represent the request pattern of all content before updating time t u with the length of l, and its corresponding label, respectively. Considering the fact that there are N c number of content through the network, and our objective is to predict the Top-K popular content in the next updating time, the problem at hand is a multilabel classification, where y ul = 1 illustrates that content c l would be popular at t u+1 . Therefore, c l should be stored at the storage of hgNB to increase the cache-hit-ratio. Step 4 (Data Labeling): Due to the limited storage of hgNBs, it is sufficient to identify the Top-K popular content, instead of predicting the popularity of all content at each updating time. According to the request pattern of multimedia content, we calculate the probability of requesting content c l , for (1 ≤ l ≤ N c ), which is obtained as follows (2) Note that, relying on the probability of content as a single criteria for identifying the popularity of content has the following disadvantages: (i) Popular content with a high number of requests will be identified as the Top-K popular content for a long time, even if they are becoming unpopular, and; (ii) The popularity of new/unknown coming content (first appearance) would be predicted with a considerable delay, because the cumulative number of requests of such content is less than other content that are existing for a long time. To tackle with this issue, we use the skewness of the request pattern as another metric, which is a widely used indicator in timeseries forecasting models [31] . The skewness of content c l is denoted by ζ l , where ζ l < 0 shows that the number of requests of content c l increases over time. Finally, the Top-K content with the highest probability and the negative skew will be labeled as the Top-K popular content. This completes presentation of the data preparation for training the TEDGE caching framework. Next, we present the ViT architecture. Generally speaking, the main characteristics of Transformers are as follows: (i) Non-sequential: Unlike RNN, the Transformer's attention mechanism makes it unnecessary to process data in the same order. As a result, Transformer is more parallelization than RNNs, which means it takes less time to train; (ii) Self Attention, indicating the similarity scores between different elements of a sequential data, and; (iii) Positional Embeddings: Since Transformers are non-sequential learning models, the order of information in a sequential data is missing. Therefore, Positional embeddings is introduced for recovering position information. As it can be seen from Fig. 2 , the TEDGE caching framework consists of the following three modules: (a) Patch and Position Embeddings; (b) Transformer encoder, and; (c) MLP head, which are described as follows: Patch and Position Embeddings: As it can be seen from Fig. 2 , the segmented CP data X u is split into N sequence of non-overlapping patches with the fixed-size of (S × S), where the total number of patches is N = w/S. After this step, each patch is flattened into a vector x p u,j ∈ R S 2 for (1 ≤ j ≤ N ). To embed vector x p u,j ∈ R S 2 into the model's dimension d, a linear projection E ∈ R S 2 ×d is used, which is shared among all patches, where the output of this projection is referred to as the patch embeddings. We append a learnable embedding token x cls to the beginning of the sequence of embedded patches [32] . Finally, the position embeddings E pos ∈ R (N +1)×d , is added to the patch embeddings to explicitly encode the order of the input sequence. The output of the patch and position embeddings Z 0 is given by Transformer Encoder: Given the output of the linear projection, the sequence of vectors Z 0 is fed to the transformer encoder [22] . As it can be seen from Fig. 2 , the transformer encoder consists of L layers, with two modules, i.e., the Multihead Self-Attention (MSA) mechanism, and the MLP modules, where MLP module consists of two linear layers with Gaussian Error Linear Unit (GELU) activation function. The output of the MSA and MLP modules of layer l, for (1 ≤ l ≤ L) are given by where a layer-normalization is used to avoid the degradation problem [33] . Finally, the output of the Transformer is where z L0 is used for classification purposes, which is passed to a Linear Layer (LL), i.e., y = LL(LayerN orm(z L0 )). This completes the description of the Transformer autoencoder. Next, we present the description of the SA and the MSA, respectively. The SA module [22] is used in the Transformer architecture to focus on significant parts of a given input by capturing the interaction between different vectors in Z ∈ R N ×d , where Z consists of N vectors, each with an embedding dimension of d. Towards this goal, three different matrices are defined, named Queries Q, Keys K, and Values V , computed by a linear transformation as follows where W QKV ∈ R d×3d h represents the trainable weight matrix, and d h is the dimension of Q, K, and V . The SA block measures the pairwise similarity between each query and all keys. The output of the SA block SA(Z) ∈ R N ×d h , which is the weighted sum over all values V , is given by where term QK T √ d h is the scaled dot-product of Q and K by √ d h and softmax is used to convert the scaled similarity to the probability. Finally, the output of the MSA module is given by where W M SA ∈ R hd h ×d and d h is set to d/h. In this Section, we first evaluate different variants of the proposed TEDGE caching framework to obtain the best architecture through the process of trial and error. Considering the location of UE, which is obtained from the ZIP code and followed by Reference [20] , six hgNBs are employed in different areas, where the classification accuracy is averaged over all hgNBs. In all experiments, the one-dimensional timeseries content's request data is converted to a sequential set of images, which is known as the Gramian Angular Field (GAF) technique [34] . Utilizing GAF method, not only the temporal characteristics of the data is preserved, but also the temporal correlations of data are included. In Back Propagation (BP) training, Adam optimizer is employed, where the weight decay and betas are set to 0.001 and (0.9, 0.999). The size of the input image, the size of input patches, and the batch size are (25 × 25), (5 × 5), and 256, respectively. Finally, we use binary cross-entropy as the loss function for our multi-label classification problem. According to the results in Table I , increasing the model dimension from 32 to 128 (Model 1 to Model 3), and also the number of MLP layers from 1 to 3 (Models 3, 5, and 6) increase the classification accuracy, while increasing the number of trainable parameters. Moreover, we evaluate the effect of MLP size on the classification accuracy. As it can be seen from Table I , increasing the MLP size from 256 to 512 (Model 3 and Model 9) decreases the classification accuracy. Similarly, there is no improvement in the classification accuracy by increasing the number of transformer layers (see Models 3 and 4). Furthermore, considering Model 6 to Model 8, the best number of heads in this architecture is equal to 8. Note that we also evaluated the effect of window length on the classification accuracy, while no improvement has been achieved by changing the window length. Finally, we compare the performance of the proposed TEDGE caching framework with five state-of-the-art caching schemes on the Movielens dataset, including LRU, LFU, PopCaching [30] , LSTM-C [12] , and the TRansformer (TR) caching, which is an upgraded version of the attention-based neural network in Reference [35] . While the attention-based model in Reference [35] is used for predicting the request pattern of online content, we adopt it to predict the Top-K popular ones. Fig. 3 compares the performance of the proposed TEDGE scheme with other baselines mentioned above from the aspect of the cache-hit ratio, when the DNN models reach the steady state. In the content caching context, cache-hit-ratio is a widely used metric, illustrating the ratio of requests served by hgNBs versus total requests. Considering the Zipf distribution for the content popularity profile, we set the storage capacity of hgNBs to 10% of the total content [5] . As shown in Fig. 3 , the optimal strategy [12] is a caching scheme, where all requests are served through hgNBs, which cannot be obtained in reality. According to the results in Fig. 3 , the proposed TEDGE caching framework obtains the highest cache-hit-ratio in comparison to its state-of-the-art counterparts. In this paper, we presented a Transformer-based Edge (TEDGE) caching framework with the application to the Mobile Edge Caching (MEC) networks. In order to efficiently learn the real-time caching strategy from the time-series request pattern of multimedia content, we employed a Vision Transformer (ViT) architecture. To the best of our knowledge, this is the first time that a ViT architecture is used in MEC networks to increase the cache-hit-ratio by simultaneously identifying the Top-K popular content with high accuracy. Simulation results showed that the proposed TEDGE caching-CS scheme improves the cache-hit ratio when compared to its state-of-the-art counterparts. Edge Cloud Server Deployment with Transmission Power Control through Machine Learning for 6G Internet of Things HCP: Heterogeneous Computing Platform for Federated Learning Based Collaborative Content Caching Towards 6G Networks Caching at the edge in high energy-efficient wireless access networks Joint Cache Placement and Bandwidth Allocation for FDMA-based Mobile Edge Computing Systems Cache Replacement Schemes Based on Adaptive Time Window for Video on Demand Services in Femtocell Networks Mobility-Aware Femtocaching Algorithm in D2D Networks Based on Handover Joint Transmission Scheme and Coded Content Placement in Cluster-centric UAV-aided Cellular Networks Spatial multi-LRU caching for wireless networks with coverage overlaps Content-Aware Proactive Caching for Backhaul Offloading in Cellular Network Online Proactive Caching in Mobile Edge Computing Using Bidirectional Deep Recurrent Neural Network PA-Cache: Evolving Learning-Based Popularity-Aware Content Caching in Edge Networks Toward Edge-Assisted Video Content Intelligent Caching With Long Short-Term Memory Learning DeepCachNet: A Proactive Caching Framework Based on Deep Learning in Cellular Networks Video Popularity Prediction: An Autoencoder Approach With Clustering Deep Reinforcement Learning-Based Edge Caching in Wireless Networks Dynamic Content Update for Wireless Edge Caching via Deep Reinforcement Learning DeepChunk: Deep Q-Learning for Chunk-Based Caching in Wireless Data Processing Networks Mobility-Aware Proactive Edge Caching for Connected Vehicles Using Federated Learning Mobile Social Media Networks Caching with Convolutional Neural Network Deep Learning Based Caching for Self-Driving Cars in Multi-Access Edge Computing LSTM for Mobility Based Content Popularity Prediction in Wireless Caching Networks Attention is all you need Towards Hit-Interruption Trade off in Vehicular Edge Caching: Algorithm and Analysis Deep Learning for Wireless Coded Caching With Unknown and Time-Variant Content Popularity Cooperative Caching and Transmission Design in Cluster-Centric Small Cell Networks Deep Reinforcement Learning for Trustworthy and Time-Varying Connection Scheduling in a Coupled UAV-Based Femtocaching Architecture Federated Deep Reinforcement Learning for Internet of Things With Decentralized Cooperative Edge Caching The Movielens Datasets: History and Context Cache Content-Selection Policies for Streaming Video Services Popularity-Driven Content Caching Forecasting Crashes: Trading Volume, Past Returns, and Conditional Skewness in Stock Prices Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding Layer normalization Day-Ahead Solar Irradiation Forecasting Utilizing Gramian Angular Field and Convolutional Long Short-Term Memory Attention-Based Neural Network: A Novel Approach for Predicting the Popularity of Online Content