1 Introduction

With the advances in GPS devices and the smart city’s infrastructure, more and more mobile data has been generated [23]. Vehicles, smartwatches, traffic signals, and mobile phones, only to name a few, generate a massive volume of trajectory data. These data offer an unprecedented opportunity to discover rich information about traffic behavior, people’s mobility, and weather impact [3, 21].

Anomaly detection is the task of finding observations that stand out as dissimilar to all others [4]. Particularly in the traffic context, which is easily influenced by external factors (e.g., accidents, detours, events, and weather conditions), trajectory anomaly detection is crucial to understand traffic behavior and to support better decision-making by transit authorities.

Although trajectory anomaly detection has been a research hotspot [1], some challenges remain. For example, most solutions rely on handcrafted features and physical trajectory characteristics such as density [8], distance [9], and isolation [5]. Such features can be costly when dealing, for instance, with a high volume of trajectories, or might not work well in the case of data sparseness [20], which occurs when the period between consecutive trajectory’s reported points is too large. In addition, little attention has been paid to finding anomalous regions in online way [5], which can benefit real applications as follows:

Example 1:

A car ridesharing company might store thousands of daily trajectories to suggest the best routes to its riders and drivers. Therefore, finding anomaly trajectories is an essential feature since it can be helpful to recommend alternative paths, free of anomalies such as traffic congestion, and alert riders or penalize drivers that do not follow the recommended paths by its app.

Example 2:

Public transportation agencies face traffic problems such as congestion, accident, and poor weather conditions in large cities worldwide. At the same time, these agencies need to serve the population by making real-time decisions to deal with these issues, such as proposing alternative routes or releasing more buses. Given that, an online trajectory anomaly detection approach can identify problematic bus trips to allow transit authorities to intervene as quickly as possible.

In this paper, we propose an approach that uses language modeling, commonly used in Natural Language Processing tasks, to: (1) detect anomalous trajectories in bus trajectories and (2) pinpoint the abnormal points in these trajectories (sub-trajectory anomaly detection). Our solution allows these tasks to be performed either offline or online, i.e., as the buses move along their route. In addition, it can also easily be adapted to other types of trajectories (e.g., cars, people, and vessels) since our model does not use any specific aspect from the bus domain (e.g., bus stops).

The key idea of our model is to learn the language of well-formed trajectories and then identify erroneous (ill-formed) trajectories and the trajectories’ points where the errors occur. For that, given an input trajectory T mapped by our solution to a sequence of tokens, our language model (LM) generates \(\hat{T}\), the most likely (common) sequence of points for T, which represents T supposedly without anomalies. Our assumption is, therefore, that since anomalous points are typically few and different from the others [5], there is a small chance that abnormal points are present in \(\hat{T}\). Based on that, our solution produces an anomaly score for T and pinpoints anomalous regions in T by comparing T with \(\hat{T}\).

We build our language model using a deep generative encoder-decoder Transformer [16] to learn the relationships between the sequential points in trajectories. More specifically, our solution first maps the input raw trajectory’s points into a geographical grid and uses each grid cell’s id as a token to represent the trajectory points. Then, this token-based trajectory representation feeds the Transformer encoder, which applies the self-attention mechanism to relate tokens in the sequence. The decoder receives the encoder’s output and leverages previously predicted tokens to generate the next one based on the self-attention mechanism.

We have conducted an extensive experimental evaluation on real-world bus datasets from Recife (Brazil) and Dublin (Ireland) cities. The results show that our language model effectively detects whether trajectories are anomalous and, at the same time, finds anomalous regions without any handcrafted features.

The rest of this paper is organized as follows. Section 2 reviews the state of the art of anomaly detection in GPS trajectory. In Sect. 3, we present some background concepts and the problem statement. Section 4 delineates our model, and Sect. 5 describes the datasets and the setup experimentation. We compare our results with the state-of-the-art algorithms and previous research in Sect. 6. Finally, the conclusion and future work are drawn in Sect. 7.

2 Related Work

A recent survey [1] provides a comprehensive summary of the state-of-the-art solutions in trajectory anomaly detection. Most of them are based on distance [9] and pattern mining [5]. In addition, some approaches use machine learning on a supervised [6] and semi-supervised way [13]. In this section, we discuss some of these approaches in detail.

iBOAT (Isolation-based Online Anomalous Trajectory Detection) [5] detects anomalies point-by-point by isolating trajectories that are “few” and “different” from historical trajectories on the same route. It uses a “support” function to count how many historical trajectories share points with an ongoing trajectory (called a window). If the support value falls below a certain threshold, the points in the window are marked as anomalous. An anomaly score is then assigned using a logistic function, giving high values to anomalous points and low values to non-anomalous ones. iBOAT operates online, generating anomaly scores and detecting abnormal points in real-time, but it requires trajectories with the same departure and destination points to function.

In GMVSAE [13], the authors propose a deep learning encoder-decoder approach to detect whether the trajectory is anomalous. The approach uses an LSTM (Long Short-Term Memory) architecture to encode and decode trajectories. The approach uses the encoder to model trajectories by a gaussian distribution mixture. Then, the proposed method uses each distribution as the decoder’s input information to calculate the probability of an ongoing trajectory belonging to one of the distributions. The trajectory is anomalous if the probability is below a certain threshold.

STOD (Spatial-Temporal Outlier Detector) [6] uses a supervised deep-learning model to detect trajectory anomalies in bus routes. It classifies bus trajectories based on predefined routes and considers them anomalous if the model’s confidence in the classification is low. Each trajectory point is represented with features like GPS timestamp and a pre-trained embedding vector, which are fed into a Bi-GRU (Gated Recurrent Unit) architecture followed by a Multi-Layer Perceptron (MLP) with a softmax function. STOD calculates entropy over the softmax output’s probability distribution to determine prediction confidence. A trajectory is marked as anomalous if the confidence score falls below a threshold. Similar to STOD, [2] proposes a multi-class Convolutional Neural Network to classify trajectories by route IDs and to identify anomalies based on misclassification or low-class probability. Both approaches require route ID labels, do not perform online anomaly detection, and do not detect anomalous regions.

Lastly, CTSS (Continuous Trajectory Similarity Search) [19] presents an online trajectory search method based on similarity scores to detect anomalous trajectories. For this, the authors proposed an approach that considers the current point of an input trajectory and all possible paths to arrive at the trajectory destination. Given all those future trajectories and a reference trajectory (ground truth), the approach calculates similarity scores between each pair (future trajectory and the reference one) and returns the one with minimum distance. Then, if the score is greater than a threshold, the input trajectory is anomalous. CTSS needs ground truth trajectories to calculate the similarity scores and knowledge of the trajectory destination in advance.

3 Problem Formulation

In this section, we provide some background concepts and state the problem we deal with in this work.

Definition 1

[Bus Trajectory]. We define a bus trajectory T as a sequence of consecutive GPS points collected from a bus trip, denoted as \(T=\{p_1,p_2,...,p_n\}\) where n is the length of the trajectory. Each point \(p_i=\{lat_i,lng_i\}\in \mathbb {R}^{2}\) is composed of latitude (\(lat_i\)) and longitude (\(lng_i\)) is associated with a bus trip and ordered by the points’ timestamp, i.e., \(tsp_i < tsp_{i+1}\).

Definition 2

[Spatial Anomaly Trajectory]. We consider a trajectory T as anomalous if some of its points spatially diverge from regular trajectories of T’s assigned bus route.

Definition 3

[Problem Statement]. Given a bus trajectory T, we aim to calculate the spatial anomaly score of T and detect the anomalous points of T.

4 Method

In this section, we introduce our approach to discovering abnormal trajectories and localizing anomaly regions in trajectories. As Fig. 1 shows, our solution is composed of three main components: Grid Mapping, Transformer Language Model, and Anomaly Detector. They work as follows. Given an input bus trajectory, the Grid Mapping discretizes it by mapping each of its points to a geographical grid cell, represented by a token, generating a sequence of grid cell tokens. This sequence is passed to a language model, a deep generative encoder-decoder Transformer, that produces a series of predicted grid cell tokens. Finally, the Anomaly Detector compares the original token sequence with the predicted one to calculate the trajectory’s anomaly score and identify anomalous points if they exist. In the remainder of this section, we provide further details about each one of these components.

Fig. 1.
figure 1

The trajectory anomaly detection solution proposed in this work.

4.1 Grid Mapping

The first step of our approach, Grid Mapping, maps the trajectories, which are multivariate times series, into a univariate sequence of tokens. For this purpose, we use the H3 (Hexagonal Hierarchical Geospatial Indexing System)Footnote 1 library to create a grid system based on hexagonal cells and a hierarchical index. More specifically, given a raw trajectory \(T=(p_1, p_2,..,p_n)\), where \(p_i\) is each trajectory point, represented by its latitude and longitude (latlong), and a grid of cell G, we use the geoToH3 function in the H3 library to map points of trajectories into grid cell locations \(c_i\), generating \(T'=(c_i, c_2,...,c_n)\). As a result, every point that falls into the same cell has the same identifier (token). Therefore, this mapping reduces the complexity of dealing with a continuous multi-dimensional domain to a discrete uni-dimensional one, which language models can adequately process.

4.2 Transformer Encoder

The first component of our language model is the Transformer Encoder which maps the sequence of trajectory tokens \(T'\) into a set of vectors that feeds the Transformer Decoder. For that, it encodes the tokens in \(T'\) based on the other tokens (points) in \(T'\) by applying the self-attention strategy.

More concretely, as Fig. 1 depicts, the encoder first generates a sequence of embeddings from the trajectory tokens. An embedding is a vector representation of a token in an n-dimension spaceFootnote 2. Next, a positional encoding adds position information to the embeddings. Similar to [16], our position encoder is calculated as:

$$\begin{aligned} PE(pos,2_i) = sin(pos/1000^{2_i/d_{model}}) \end{aligned}$$
(1)
$$\begin{aligned} PE(pos,2_{i+1}) = cos(pos/1000^{2_i/d_{model}}) \end{aligned}$$
(2)

where sin and cos are the trigonometric functions sine and cosine, respectively, pos is the position of the point in the trajectory, and \(d_{model}\) is the dimension in the embedding vector. To create the final embedding representation for each input token, the model performs an element-wise addition of the token embedding with the positional encoding vector.

The model then passes these embeddings to the Transformer block with four identical encoder layers. It uses the so-called multi-head self-attention to allow the network to attend different input sequence positions and learn which points in the sequence are relevant to the current one. Multiple heads create multiple representation subspaces to learn a set of queries Q and keys K of dimension \(d_k\), and values V dimension \(d_v\) weight matrices. Each head computes the attention weights for a given token embedding j (\(Embedding_j\)) as follows:

$$\begin{aligned} Q_{i,j} = W^{Q}_i \cdot Embedding_j \end{aligned}$$
(3)
$$\begin{aligned} K_{i,j} = W^{K}_i \cdot Embedding_j \end{aligned}$$
(4)
$$\begin{aligned} V_{i,j} = W^{V}_i \cdot Embedding_j \end{aligned}$$
(5)
$$\begin{aligned} Head = softmax(\dfrac{Q_iK_i}{\sqrt{d_k}})V_i \end{aligned}$$
(6)

with parameters matrices \(W^Q \in \mathbb {R}^{d_{model}Xd_k}\), \(W^K \in \mathbb {R}^{d_{model}Xd_k}\), and \(W^V \in \mathbb {R}^{d_{model}Xd_v}\).

The multi-head attention combines the individual heads as follows:

$$\begin{aligned} Multi\_Head=concat(Head_1, Head_2,...,Head_n)W^O \end{aligned}$$
(7)

where \(W^O \in \mathbb {R}^{hd_{v}Xd_v}\) is a weight matrix learned during training.

On top of the multi-head attention, there are two skip connections and two normalization layers interspersed with fully connected feed-forward networks. The residual connection helps the encoder to combine features from different layers, merging different levels of representations [7]. The normalization layers standardize the residual connection and the feed-forward outputs, giving numerical stability to the model. The model calculates it as follows:

$$\begin{aligned} \bar{x}_i = \frac{x_i - \mu _B}{\sqrt{\sigma _B^2 + \epsilon } } \end{aligned}$$
(8)
$$\begin{aligned} z_i = \gamma \cdot \bar{x} + \beta \end{aligned}$$
(9)

where \(\mu _B\) and \(\sigma ^2\) are respectively the batch mean and standard deviation, \(\epsilon \) is a stability factor added to variance to avoid a division by zero, \(\gamma \) and \(\beta \) are learning parameters, and \(z_i\) is the normalized value of \(x_i\). Note that \(x_i\) is the concatenation between \(Multi\_head\) vector and positional Embeddings (skip connection) and the normalization vector along with feed-forward output as shown in Fig. 1.

Lastly, the feed-forward network has two layers on the top of the encoder. Their goal is to project the normalization of the multi-head attention to another dimension space and add non-linearities between them.

The encoder’s final output is the matrix \(Z=(K, V)\), where K are key vectors and V value vectors of the tokens in the input sentence. The decoder uses this matrix to focus on appropriate tokens in the input sentence to generate the predicted sequence.

4.3 Transformer Decoder

The Transformer Decoder is the second component of our language model. Its goal is to produce a grid cell token sequence from the input sentence encoded by the Transformer Encoder. For that, it uses the auto-regressive method, i.e., it predicts each token in the sequence based on the previous ones produced by the model.

Similar to the encoder, the decoder is also composed of Transformer blocks. The decoder self-attention works, however, in a slightly different way. While the self-attention in the encoder considers all tokens from the trajectory to generate the attention weights, the decoder only considers tokens preceding the current one to predict the next. For that, the Transformers mask future positions using the look-ahead mask approach [16].

In addition, the first decoder multi-head attention layer learns a query matrix \(Q_{dec}\) from the previously predicted tokens. First, the decoder receives the output from the previous layer (embeddings). Then, similar to the encoder, the decoder augments it with a positional embedding layer and feeds it to multi-head attention to generate \(Q_{dec}\). After, the query vector and the residual connection feed a normalization layer similar to the decoder. Next, a second multi-head attention layer receives the learned decoder \(Q_{dec}\) and the matrix \(Z=(K, V)\) from the encoder output to guide the query/search process. This second multi-head layer allows the decoder to focus on which trajectory points from the encoder are relevant to predict the next token/point. After the second multi-head layer generates the encoder-decoder attention vector, the decoder passes it to a feed-forward layer, followed by another normalization to add non-linearities and stability to the values.

Finally, the last layer implements a feed-forward neural network that projects the decoder vectors in a large dimension (vocabulary size) to represent the logit vectorFootnote 3. Each logit represents a token/cell score, which the softmax function turns into a probability. The model outputs the highest probability token for each position in our decoding strategy, generating the predicted sequence \(\hat{T}\).

4.4 Training

We use the sparse categorical cross-entropy loss to train our model since the labels are integers. The loss is described as follows:

$$\begin{aligned} L(y,\hat{y}) = - \sum _{j=0}^{M}\sum _{i=0}^{N} (y_{ij}\cdot log(\hat{y}_{ij})) \end{aligned}$$
(10)

where \(y_{ij}\) is the target, and \(\hat{y}_{ij}\) represents the prediction. To train the model, we use the Adam optimizer (\(\beta _1=0.9\), \(\beta _2=0.9\), and \(\epsilon =1e^{-9}\)) with a flexible learning rate that increases at the beginning of training and decreases slowly in the remaining training steps conform [16]. We also apply residual dropout with a rate of 0.1 for each layer in the encoder and decoder.

It is worth mentioning that our approach learns to generate the input trajectory, then input and targets are the same for training. In addition, during training, we use the teacher-forcing, i.e., we pass the true output to each successive step in the decoder. Finally, in the inference step, we provide the input to the encoder and a starting token to the decoder that outputs prediction one token at a time.

4.5 Anomaly Detector

Given the encoder’s input token sequence and the decoder’s output, as aforementioned, our solution produces two outputs for a given bus trajectory: its anomaly score and the regions where the anomaly occurs in the trajectory. Our primary assumption is that our trained language model predicts the correct sequence. Any token in the input sequence (trajectory) that diverges from the predicted ones is considered an anomaly.

Thus, to calculate the trajectory’s anomaly score, the detector compares the sentence predicted by the decoder with the encoder’s input sentence by aligning them and computing their Hamming distance [14], as follows: score = 1 - (Hamming(T,\(\hat{T}\))/n), where \(\hat{T}\) is the decoder’s predicted sequence, T is the language model input sequence, and n represents their size.

We consider the anomalous regions in the input sequence T the trajectory points represented by the unmatched tokens between T and \(\hat{T}\).

5 Data Description and Setup

5.1 Experimental Setup

In this section, we provide details about the setup of our experimental evaluation.

Datasets. We conducted our experiments in two real-world bus trajectory datasets. The first dataset is from Recife, Brazil. It comprises 19,290 trajectories (100 points on average) from 82 bus lines generated by 238 buses from October 2017 to November 2017. Each bus reports points at intervals of 30 s, containing longitude, latitude, timestamp, route id, vehicle id, instantaneous velocity, and travel distance (from the beginning of the trip). The second dataset is from DublinFootnote 4, Ireland. It contains 60,084 trajectories (206 points on average) and 68 bus lines. Each trajectory point is reported between 20 and 50 s. In total, there 12,497,472 points collected from Jan 01 2013 to Jan 04 2013. Each point contains the attributes: latitude, longitude, timestamp, line id, journey id, and vehicle id.

Pre-processing. As mentioned in Sect. 4.1, we map the trajectories into a geographical grid. Table 1 presents the statistics of trajectories before and after the grid-mapping transformation using the H3 parameter resolution 10 (16 is the maximum resolution), which we chose by experimentationFootnote 5. As can be seen, this transformation greatly reduces the dimensionality of both datasets.

Table 1. Number of unique points before and after the Grip Mapping.

Ground Truth. Since there is no label available in our datasets, one can try manually labeling anomalies as [17, 18] or generate artificial ones as [10, 12, 22]. We chose to generate synthetic anomalies, as manual labels are time-consuming, by adding some perturbation in the real trajectories. We do so by randomly choosing the first point in an actual trajectory t and shifting it along with the following n points in t sequentially. In the experiments, we use two parameters to create different anomaly trajectories from real ones: d (the distance in kilometers from the real points) and p (the percentage of shifted points). For example, using \(d=0.5\) and \(p=0.1\), 10% of trajectory points are moved 500 m from the real point. We generate anomalous trajectories for our experiments considering the values of \(p=[0.1, 0.2, 03]\) and \(d=1.0\).

Table 2. Values of hyper-parameters of Transformer.

Baselines. We evaluate the following anomaly detection methods in our experiments:

  • RioBusData [2] is a supervised method to detect anomalous bus trajectories by classifying them in bus routes. It uses a Convolutional Neural Network (CNN) fed by raw bus trajectories. On the top of CNN, a softmax function outputs a vector of probability where each value is the probability of class membership for each route/label. A trajectory is abnormal if its highest-class probability is below a given threshold.

  • STOD [6] is also a supervised method that detects anomalous bus trajectories and learns to classify bus trajectories in their routes using a deep-learning network. The model outputs the routes’ class distribution of a given trajectory. From this distribution, it calculates the uncertainty of the classifier using entropy as a measure of anomaly degree. The higher the classifier uncertainty, the higher the entropy. A trajectory is anomalous if the entropy of the classifier’s probability distribution output for it is higher than a threshold.

  • GM-VSAE [13] uses an encoder-decoder strategy to detect anomalous trajectories. To perform that, firstly, the encoder infers a disentangled latent space to discover the distribution of each trajectory based on this space. This distribution is then fed to the decoder that generates a trajectory. The method calculates a score comparing the generated trajectory with the input trajectory. A high score \(\approx \) 1.0 means that the input trajectory has a high probability of being an anomaly.

  • iBOAT [5] is based on the isolation mechanism [11], and an adaptive windows approach. It performs the detection of both trajectory and sub-trajectory anomaly detection. iBOAT calculates the frequency of points mapped into a grid cell to isolate “few and different” points. Based on that, trajectories that visit cells with low frequency get small scores, meaning that those points are highly likely to be an anomaly. Conversely, trajectories with high-frequent visited cells have high scores of non-anomaly. Similar to the previous methods, iBOAT also needs a threshold to detect anomalies.

  • Transformer is our proposed approachFootnote 6. We train our model on both datasets with the hyper-parameter values shown in Table 2.

It is worth pointing out that all those methods identify anomalous trajectories, but only our approach (Transformer) and iBOAT detect anomalous sub-trajectories. We randomly selected 8,200 trajectories from the Recife dataset and 6,800 from Dublin to evaluate the approaches. Based on a data analysis, we defined the maximum number of points as 100 for the Recife dataset and 208 for Dublin.

Evaluation Metrics. We use F1-measure, Precision, Recall, and PR-AUC as evaluation metrics since they are usually applied to evaluate outlier detection methods [1, 13]. To verify whether the F1-measure values of our model are statistically different from the baselines, we execute the Wilcoxon statistical test [15]. The test verifies whether two paired samples (F1-measure values of our solution vs. a baseline) come from the same distribution. Given that, we set the significance level \(\alpha =5\%\). In our context, the null hypothesis \(h_0\) considers that the median difference between the F1 values of a pair of models is zero. We performed this statistical test on the instances in the test set.

6 Results and Discussion

In this section, we first present the evaluation of the trajectory outlier identification and, subsequently, the region anomaly detection.

Table 3. Results of anomaly trajectory detection on the Dublin dataset.
Table 4. Results of anomaly trajectory detection on the Recife dataset.

6.1 Trajectory Anomaly Detection

Table 3 presents the results for the trajectory anomaly detection task on the Dublin dataset. Transformer outperforms the baselines in all scenarios and metrics. For example, considering the best results on F1, our approach is at least \(17\%\) better than all baselines. To confirm this, Table 6 depicts the p-values for the results of each baseline in comparison to Transformer on the Dublin dataset. All the p-values are smaller than the significance level (0.05), which supports that our Transformer network has, in fact, superior performance than the baselines on this dataset.

Regarding the results on the Recife dataset, presented in Table 4, our method obtains better F1-measure values than STOD and RioBusData, and comparable ones with GMVSAE and iBOAT. The p-values of the hypothesis tests on this dataset for F1-measure, Table 7, confirm this: the p-values of GMVSAE and iBOAT versus Transformer in all scenarios are higher than the significance level of 0.05, meaning there is no statistical difference in terms of F1-measure between our approach and them. Regarding RioBusData and STOD, however, the p-values are lower than 0.05 in two of the three anomaly cases. Looking at precision values, Transformer achieved the best overall results but lower recall than GMVSAE and iBOAT. In practice, better precision can be an advantage since an anomaly detection model can be considered a filter that identifies possibly a few anomalous trajectories in a large set of trajectories. The more precise this filter is, the few false negative anomalies need to be inspected.

Overall, the methods built to detect anomalies (Transformer, GMVSAE and iBOAT) outperformed in almost all scenarios the ones that try to do this indirectly (STOD and RioBusData), i.e., learning to classify routes instead of anomalies. For example, for \(p=0.1\), RioBusData obtained the lowest F1 (0.651) on Dublin and Recife (0.533). However, on the Dublin dataset, STOD shows better F1 results than iBOAT for \(p=0.2\) (0.69 vs 0.673 respectively), and for \(p=0.3\) obtained the F1 second best value (0.724) only behind Transformer (0.854). We also observed that STOD is more sensible than our method over the percentage outlier variation p. For example, for \(p=0.1\), and \(p=0.3\), the F1-measure is 0.58 and 0.66 on Recife (difference of 0.8), respectively. In contrast, our method is more stable regarding the outlier level. For instance, for \(p=0.1\) and \(p=0.3\), the F1-measure is respectively 0.84 and 0.85 on Dublin and 0.66 and 0.67 on Recife, i.e., there is not much difference.

Fig. 2.
figure 2

PR-AUC of the approaches for route 1 on the Dublin dataset and route 54 on the Recife dataset.

Table 5. Results for the region anomaly detection models.

To provide a detailed analysis of the approaches on individual routes, Fig. 2 shows the PR-AUC curves of all methods on both datasets in two different routes, one from each dataset with \(p=0.3\). In route 54 from Recife, our approach has the highest area under curve \(\approx 0.98\), outperforming both the unsupervised methods (GMVSAE \(\approx 0.94\) and iBOAT \(\approx 0.77\)) and the supervised ones (STOD \(\approx 0.54\) and RioBusData \(\approx 0.67\)). Looking at route 1 from Dublin, we observe that the encoder-decoder methods have almost the perfect curve AUC \(\approx 0.99\), i.e., the models can adequately distinguish anomalous trajectories from non-anomalous ones. Conversely, the RioBusData has the worst PR-AUC curve \(\approx 0.70\). Finally, we can see that the precision of iBOAT degrades with a recall close to 1.

Table 6. Hypothesis test for F1 on the Dublin dataset.
Table 7. Hypothesis test for F1 on the Recife dataset.

6.2 Region Anomaly Detection

Table 5 shows the results between Transformer and iBOAT on region anomaly detection. We observe that Transformer outperforms iBOAT on the Recife dataset in all scenarios. This occurs mainly because our model achieved the high values of precision (0.988, 0.975, 0.961). The methods, however, are similar regarding recall for \(p=0.1\), for instance, Transformer’s recall is 0.992, and iBoat 0.990. On the Dublin dataset, the methods are qualitatively similar. Note that the most difference between the models occurs for \(p=0.1\): Transformer’s F1 is 0.986, and iBOAT’s F1 is 0.978.

We applied the Wilcoxon test to verify whether there is a statistical difference between the F1-measure values of the methods on this task. On the Recife dataset, the models are statistically different, with p-value lower than our significance level of 0.05 for all scenarios. On the Dublin dataset, however, there is no statistical evidence to reject the \(h_0\), since the p-values are higher than 0.05 and, therefore, both models are statistically equivalent in terms of F-1 measure.

To present concrete examples of the detection of region anomalies on real trajectories by our model, Fig. 3a shows the expected trajectories of Dublin route 1, and Fig. 3b depicts the anomalous regions identified (represented by the red dots) by Transformer. One can see from these plots that our approach identifies the anomalous regions in this trajectory very precisely.

Fig. 3.
figure 3

Example of anomaly detection inference.

7 Conclusion

In this paper, we propose a solution that applies an encoder-decoder transformer language model in bus trajectory data to solve two problems: trajectory and sub-trajectory anomaly detection. Our solution transforms a trajectory into a discrete token sequence by mapping its points to tokens representing geographical grid cells. This sequence is then passed to the Transformer language model that outputs a predicted sequence, supposedly without anomalies. Finally, our solution calculates the trajectory’s anomaly score by applying the hamming distance between the two sequences and identifies the anomalous regions by looking at the unmatched tokens between them. Experiments in two real-world bus trajectory datasets demonstrate that our approach is effective for anomalous trajectory detection and anomalous region detection tasks.

In future work, we intend to train our approach in multiple trajectory datasets to verify whether it can learn general trajectory patterns (deep representation). Once our approach learns those patterns, we want to exploit other tasks, such as trajectory similarity and classification, using transfer learning.