Applying Transformers for Anomaly Detection in Bus Trajectories

Cruz, Michael; Barbosa, Luciano

doi:10.1007/978-3-031-79029-4_12

Michael Cruz⁹ &
Luciano Barbosa⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15412))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

509 Accesses

Abstract

Trajectory anomaly detection is essential in understanding traffic behavior, especially when external factors such as congestion, accidents, and poor weather conditions occur. Most of the previous approaches for this task highly rely on handcrafted features and physical trajectory characteristics, which can be costly to calculate in a large volume of data. In this paper, we propose a novel trajectory anomaly detection approach that relies on language modeling to learn well-formed GPS bus trajectories and, based on it, identifies anomalous trajectories and pinpoints their abnormal points (sub-trajectory anomaly detection). Our solution uses a deep generative encoder-decoder Transformer that learns relationships between the sequential points in the trajectories based on the self-attention mechanism. It does not require manual feature extraction and can be easily adapted to any type of trajectory (e.g., cars, people, and vessels). We have performed an extensive experimental evaluation that shows: (1) our approach is effective for both trajectory and sub-trajectory anomaly detection; and (2) it outperforms the baselines in most evaluation scenarios.

Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF

Detecting Anomalous Bus-Driving Behaviors from Trajectories

Article 30 September 2020

TS-DBSCAN: To Detect Trajectory Anomaly for Transportation Vehicles

Anomalous Trajectory Detection Between Regions of Interest Based on ANPR System

1 Introduction

With the advances in GPS devices and the smart city’s infrastructure, more and more mobile data has been generated [23]. Vehicles, smartwatches, traffic signals, and mobile phones, only to name a few, generate a massive volume of trajectory data. These data offer an unprecedented opportunity to discover rich information about traffic behavior, people’s mobility, and weather impact [3, 21].

Anomaly detection is the task of finding observations that stand out as dissimilar to all others [4]. Particularly in the traffic context, which is easily influenced by external factors (e.g., accidents, detours, events, and weather conditions), trajectory anomaly detection is crucial to understand traffic behavior and to support better decision-making by transit authorities.

Although trajectory anomaly detection has been a research hotspot [1], some challenges remain. For example, most solutions rely on handcrafted features and physical trajectory characteristics such as density [8], distance [9], and isolation [5]. Such features can be costly when dealing, for instance, with a high volume of trajectories, or might not work well in the case of data sparseness [20], which occurs when the period between consecutive trajectory’s reported points is too large. In addition, little attention has been paid to finding anomalous regions in online way [5], which can benefit real applications as follows:

Example 1:

A car ridesharing company might store thousands of daily trajectories to suggest the best routes to its riders and drivers. Therefore, finding anomaly trajectories is an essential feature since it can be helpful to recommend alternative paths, free of anomalies such as traffic congestion, and alert riders or penalize drivers that do not follow the recommended paths by its app.

Example 2:

Public transportation agencies face traffic problems such as congestion, accident, and poor weather conditions in large cities worldwide. At the same time, these agencies need to serve the population by making real-time decisions to deal with these issues, such as proposing alternative routes or releasing more buses. Given that, an online trajectory anomaly detection approach can identify problematic bus trips to allow transit authorities to intervene as quickly as possible.

In this paper, we propose an approach that uses language modeling, commonly used in Natural Language Processing tasks, to: (1) detect anomalous trajectories in bus trajectories and (2) pinpoint the abnormal points in these trajectories (sub-trajectory anomaly detection). Our solution allows these tasks to be performed either offline or online, i.e., as the buses move along their route. In addition, it can also easily be adapted to other types of trajectories (e.g., cars, people, and vessels) since our model does not use any specific aspect from the bus domain (e.g., bus stops).

The key idea of our model is to learn the language of well-formed trajectories and then identify erroneous (ill-formed) trajectories and the trajectories’ points where the errors occur. For that, given an input trajectory T mapped by our solution to a sequence of tokens, our language model (LM) generates $\hat{T}$, the most likely (common) sequence of points for T, which represents T supposedly without anomalies. Our assumption is, therefore, that since anomalous points are typically few and different from the others [5], there is a small chance that abnormal points are present in $\hat{T}$. Based on that, our solution produces an anomaly score for T and pinpoints anomalous regions in T by comparing T with $\hat{T}$.

We build our language model using a deep generative encoder-decoder Transformer [16] to learn the relationships between the sequential points in trajectories. More specifically, our solution first maps the input raw trajectory’s points into a geographical grid and uses each grid cell’s id as a token to represent the trajectory points. Then, this token-based trajectory representation feeds the Transformer encoder, which applies the self-attention mechanism to relate tokens in the sequence. The decoder receives the encoder’s output and leverages previously predicted tokens to generate the next one based on the self-attention mechanism.

We have conducted an extensive experimental evaluation on real-world bus datasets from Recife (Brazil) and Dublin (Ireland) cities. The results show that our language model effectively detects whether trajectories are anomalous and, at the same time, finds anomalous regions without any handcrafted features.

The rest of this paper is organized as follows. Section 2 reviews the state of the art of anomaly detection in GPS trajectory. In Sect. 3, we present some background concepts and the problem statement. Section 4 delineates our model, and Sect. 5 describes the datasets and the setup experimentation. We compare our results with the state-of-the-art algorithms and previous research in Sect. 6. Finally, the conclusion and future work are drawn in Sect. 7.

2 Related Work

A recent survey [1] provides a comprehensive summary of the state-of-the-art solutions in trajectory anomaly detection. Most of them are based on distance [9] and pattern mining [5]. In addition, some approaches use machine learning on a supervised [6] and semi-supervised way [13]. In this section, we discuss some of these approaches in detail.

iBOAT (Isolation-based Online Anomalous Trajectory Detection) [5] detects anomalies point-by-point by isolating trajectories that are “few” and “different” from historical trajectories on the same route. It uses a “support” function to count how many historical trajectories share points with an ongoing trajectory (called a window). If the support value falls below a certain threshold, the points in the window are marked as anomalous. An anomaly score is then assigned using a logistic function, giving high values to anomalous points and low values to non-anomalous ones. iBOAT operates online, generating anomaly scores and detecting abnormal points in real-time, but it requires trajectories with the same departure and destination points to function.

In GMVSAE [13], the authors propose a deep learning encoder-decoder approach to detect whether the trajectory is anomalous. The approach uses an LSTM (Long Short-Term Memory) architecture to encode and decode trajectories. The approach uses the encoder to model trajectories by a gaussian distribution mixture. Then, the proposed method uses each distribution as the decoder’s input information to calculate the probability of an ongoing trajectory belonging to one of the distributions. The trajectory is anomalous if the probability is below a certain threshold.

STOD (Spatial-Temporal Outlier Detector) [6] uses a supervised deep-learning model to detect trajectory anomalies in bus routes. It classifies bus trajectories based on predefined routes and considers them anomalous if the model’s confidence in the classification is low. Each trajectory point is represented with features like GPS timestamp and a pre-trained embedding vector, which are fed into a Bi-GRU (Gated Recurrent Unit) architecture followed by a Multi-Layer Perceptron (MLP) with a softmax function. STOD calculates entropy over the softmax output’s probability distribution to determine prediction confidence. A trajectory is marked as anomalous if the confidence score falls below a threshold. Similar to STOD, [2] proposes a multi-class Convolutional Neural Network to classify trajectories by route IDs and to identify anomalies based on misclassification or low-class probability. Both approaches require route ID labels, do not perform online anomaly detection, and do not detect anomalous regions.

Lastly, CTSS (Continuous Trajectory Similarity Search) [19] presents an online trajectory search method based on similarity scores to detect anomalous trajectories. For this, the authors proposed an approach that considers the current point of an input trajectory and all possible paths to arrive at the trajectory destination. Given all those future trajectories and a reference trajectory (ground truth), the approach calculates similarity scores between each pair (future trajectory and the reference one) and returns the one with minimum distance. Then, if the score is greater than a threshold, the input trajectory is anomalous. CTSS needs ground truth trajectories to calculate the similarity scores and knowledge of the trajectory destination in advance.

3 Problem Formulation

In this section, we provide some background concepts and state the problem we deal with in this work.

Definition 1

[Bus Trajectory]. We define a bus trajectory T as a sequence of consecutive GPS points collected from a bus trip, denoted as $T=\{p_1,p_2,...,p_n\}$ where n is the length of the trajectory. Each point $p_i=\{lat_i,lng_i\}\in \mathbb {R}^{2}$ is composed of latitude ($lat_i$) and longitude ($lng_i$) is associated with a bus trip and ordered by the points’ timestamp, i.e., $tsp_i < tsp_{i+1}$.

Definition 2

[Spatial Anomaly Trajectory]. We consider a trajectory T as anomalous if some of its points spatially diverge from regular trajectories of T’s assigned bus route.

Definition 3

[Problem Statement]. Given a bus trajectory T, we aim to calculate the spatial anomaly score of T and detect the anomalous points of T.

4 Method

In this section, we introduce our approach to discovering abnormal trajectories and localizing anomaly regions in trajectories. As Fig. 1 shows, our solution is composed of three main components: Grid Mapping, Transformer Language Model, and Anomaly Detector. They work as follows. Given an input bus trajectory, the Grid Mapping discretizes it by mapping each of its points to a geographical grid cell, represented by a token, generating a sequence of grid cell tokens. This sequence is passed to a language model, a deep generative encoder-decoder Transformer, that produces a series of predicted grid cell tokens. Finally, the Anomaly Detector compares the original token sequence with the predicted one to calculate the trajectory’s anomaly score and identify anomalous points if they exist. In the remainder of this section, we provide further details about each one of these components.

4.1 Grid Mapping

The first step of our approach, Grid Mapping, maps the trajectories, which are multivariate times series, into a univariate sequence of tokens. For this purpose, we use the H3 (Hexagonal Hierarchical Geospatial Indexing System)^{Footnote 1} library to create a grid system based on hexagonal cells and a hierarchical index. More specifically, given a raw trajectory $T=(p_1, p_2,..,p_n)$, where $p_i$ is each trajectory point, represented by its latitude and longitude (lat, long), and a grid of cell G, we use the geoToH3 function in the H3 library to map points of trajectories into grid cell locations $c_i$, generating $T'=(c_i, c_2,...,c_n)$. As a result, every point that falls into the same cell has the same identifier (token). Therefore, this mapping reduces the complexity of dealing with a continuous multi-dimensional domain to a discrete uni-dimensional one, which language models can adequately process.

4.2 Transformer Encoder

The first component of our language model is the Transformer Encoder which maps the sequence of trajectory tokens $T'$ into a set of vectors that feeds the Transformer Decoder. For that, it encodes the tokens in $T'$ based on the other tokens (points) in $T'$ by applying the self-attention strategy.

More concretely, as Fig. 1 depicts, the encoder first generates a sequence of embeddings from the trajectory tokens. An embedding is a vector representation of a token in an n-dimension space^{Footnote 2}. Next, a positional encoding adds position information to the embeddings. Similar to [16], our position encoder is calculated as:

$$\begin{aligned} PE(pos,2_i) = sin(pos/1000^{2_i/d_{model}}) \end{aligned}$$

(1)

$$\begin{aligned} PE(pos,2_{i+1}) = cos(pos/1000^{2_i/d_{model}}) \end{aligned}$$

(2)

where sin and cos are the trigonometric functions sine and cosine, respectively, pos is the position of the point in the trajectory, and $d_{model}$ is the dimension in the embedding vector. To create the final embedding representation for each input token, the model performs an element-wise addition of the token embedding with the positional encoding vector.

The model then passes these embeddings to the Transformer block with four identical encoder layers. It uses the so-called multi-head self-attention to allow the network to attend different input sequence positions and learn which points in the sequence are relevant to the current one. Multiple heads create multiple representation subspaces to learn a set of queries Q and keys K of dimension $d_k$, and values V dimension $d_v$ weight matrices. Each head computes the attention weights for a given token embedding j ($Embedding_j$) as follows:

$$\begin{aligned} Q_{i,j} = W^{Q}_i \cdot Embedding_j \end{aligned}$$

(3)

$$\begin{aligned} K_{i,j} = W^{K}_i \cdot Embedding_j \end{aligned}$$

(4)

$$\begin{aligned} V_{i,j} = W^{V}_i \cdot Embedding_j \end{aligned}$$

(5)

$$\begin{aligned} Head = softmax(\dfrac{Q_iK_i}{\sqrt{d_k}})V_i \end{aligned}$$

(6)

with parameters matrices $W^Q \in \mathbb {R}^{d_{model}Xd_k}$, $W^K \in \mathbb {R}^{d_{model}Xd_k}$, and $W^V \in \mathbb {R}^{d_{model}Xd_v}$.

The multi-head attention combines the individual heads as follows:

$$\begin{aligned} Multi\_Head=concat(Head_1, Head_2,...,Head_n)W^O \end{aligned}$$

(7)

where $W^O \in \mathbb {R}^{hd_{v}Xd_v}$ is a weight matrix learned during training.

On top of the multi-head attention, there are two skip connections and two normalization layers interspersed with fully connected feed-forward networks. The residual connection helps the encoder to combine features from different layers, merging different levels of representations [7]. The normalization layers standardize the residual connection and the feed-forward outputs, giving numerical stability to the model. The model calculates it as follows:

$$\begin{aligned} \bar{x}_i = \frac{x_i - \mu _B}{\sqrt{\sigma _B^2 + \epsilon } } \end{aligned}$$

(8)

$$\begin{aligned} z_i = \gamma \cdot \bar{x} + \beta \end{aligned}$$

(9)

where $\mu _B$ and $\sigma ^2$ are respectively the batch mean and standard deviation, $\epsilon $ is a stability factor added to variance to avoid a division by zero, $\gamma $ and $\beta $ are learning parameters, and $z_i$ is the normalized value of $x_i$. Note that $x_i$ is the concatenation between $Multi\_head$ vector and positional Embeddings (skip connection) and the normalization vector along with feed-forward output as shown in Fig. 1.

Lastly, the feed-forward network has two layers on the top of the encoder. Their goal is to project the normalization of the multi-head attention to another dimension space and add non-linearities between them.

The encoder’s final output is the matrix $Z=(K, V)$, where K are key vectors and V value vectors of the tokens in the input sentence. The decoder uses this matrix to focus on appropriate tokens in the input sentence to generate the predicted sequence.

4.3 Transformer Decoder

The Transformer Decoder is the second component of our language model. Its goal is to produce a grid cell token sequence from the input sentence encoded by the Transformer Encoder. For that, it uses the auto-regressive method, i.e., it predicts each token in the sequence based on the previous ones produced by the model.

Similar to the encoder, the decoder is also composed of Transformer blocks. The decoder self-attention works, however, in a slightly different way. While the self-attention in the encoder considers all tokens from the trajectory to generate the attention weights, the decoder only considers tokens preceding the current one to predict the next. For that, the Transformers mask future positions using the look-ahead mask approach [16].

In addition, the first decoder multi-head attention layer learns a query matrix $Q_{dec}$ from the previously predicted tokens. First, the decoder receives the output from the previous layer (embeddings). Then, similar to the encoder, the decoder augments it with a positional embedding layer and feeds it to multi-head attention to generate $Q_{dec}$. After, the query vector and the residual connection feed a normalization layer similar to the decoder. Next, a second multi-head attention layer receives the learned decoder $Q_{dec}$ and the matrix $Z=(K, V)$ from the encoder output to guide the query/search process. This second multi-head layer allows the decoder to focus on which trajectory points from the encoder are relevant to predict the next token/point. After the second multi-head layer generates the encoder-decoder attention vector, the decoder passes it to a feed-forward layer, followed by another normalization to add non-linearities and stability to the values.

Finally, the last layer implements a feed-forward neural network that projects the decoder vectors in a large dimension (vocabulary size) to represent the logit vector^{Footnote 3}. Each logit represents a token/cell score, which the softmax function turns into a probability. The model outputs the highest probability token for each position in our decoding strategy, generating the predicted sequence $\hat{T}$.

4.4 Training

We use the sparse categorical cross-entropy loss to train our model since the labels are integers. The loss is described as follows:

$$\begin{aligned} L(y,\hat{y}) = - \sum _{j=0}^{M}\sum _{i=0}^{N} (y_{ij}\cdot log(\hat{y}_{ij})) \end{aligned}$$

(10)

where $y_{ij}$ is the target, and $\hat{y}_{ij}$ represents the prediction. To train the model, we use the Adam optimizer ($\beta _1=0.9$, $\beta _2=0.9$, and $\epsilon =1e^{-9}$) with a flexible learning rate that increases at the beginning of training and decreases slowly in the remaining training steps conform [16]. We also apply residual dropout with a rate of 0.1 for each layer in the encoder and decoder.

It is worth mentioning that our approach learns to generate the input trajectory, then input and targets are the same for training. In addition, during training, we use the teacher-forcing, i.e., we pass the true output to each successive step in the decoder. Finally, in the inference step, we provide the input to the encoder and a starting token to the decoder that outputs prediction one token at a time.

4.5 Anomaly Detector

Given the encoder’s input token sequence and the decoder’s output, as aforementioned, our solution produces two outputs for a given bus trajectory: its anomaly score and the regions where the anomaly occurs in the trajectory. Our primary assumption is that our trained language model predicts the correct sequence. Any token in the input sequence (trajectory) that diverges from the predicted ones is considered an anomaly.

Thus, to calculate the trajectory’s anomaly score, the detector compares the sentence predicted by the decoder with the encoder’s input sentence by aligning them and computing their Hamming distance [14], as follows: score = 1 - (Hamming(T,$\hat{T}$)/n), where $\hat{T}$ is the decoder’s predicted sequence, T is the language model input sequence, and n represents their size.

We consider the anomalous regions in the input sequence T the trajectory points represented by the unmatched tokens between T and $\hat{T}$.

5 Data Description and Setup

5.1 Experimental Setup

In this section, we provide details about the setup of our experimental evaluation.

Datasets. We conducted our experiments in two real-world bus trajectory datasets. The first dataset is from Recife, Brazil. It comprises 19,290 trajectories (100 points on average) from 82 bus lines generated by 238 buses from October 2017 to November 2017. Each bus reports points at intervals of 30 s, containing longitude, latitude, timestamp, route id, vehicle id, instantaneous velocity, and travel distance (from the beginning of the trip). The second dataset is from Dublin^{Footnote 4}, Ireland. It contains 60,084 trajectories (206 points on average) and 68 bus lines. Each trajectory point is reported between 20 and 50 s. In total, there 12,497,472 points collected from Jan 01 2013 to Jan 04 2013. Each point contains the attributes: latitude, longitude, timestamp, line id, journey id, and vehicle id.

Pre-processing. As mentioned in Sect. 4.1, we map the trajectories into a geographical grid. Table 1 presents the statistics of trajectories before and after the grid-mapping transformation using the H3 parameter resolution 10 (16 is the maximum resolution), which we chose by experimentation^{Footnote 5}. As can be seen, this transformation greatly reduces the dimensionality of both datasets.

Table 1. Number of unique points before and after the Grip Mapping.

Full size table

Ground Truth. Since there is no label available in our datasets, one can try manually labeling anomalies as [17, 18] or generate artificial ones as [10, 12, 22]. We chose to generate synthetic anomalies, as manual labels are time-consuming, by adding some perturbation in the real trajectories. We do so by randomly choosing the first point in an actual trajectory t and shifting it along with the following n points in t sequentially. In the experiments, we use two parameters to create different anomaly trajectories from real ones: d (the distance in kilometers from the real points) and p (the percentage of shifted points). For example, using $d=0.5$ and $p=0.1$, 10% of trajectory points are moved 500 m from the real point. We generate anomalous trajectories for our experiments considering the values of $p=[0.1, 0.2, 03]$ and $d=1.0$.

Table 2. Values of hyper-parameters of Transformer.

Full size table

Baselines. We evaluate the following anomaly detection methods in our experiments:

RioBusData [2] is a supervised method to detect anomalous bus trajectories by classifying them in bus routes. It uses a Convolutional Neural Network (CNN) fed by raw bus trajectories. On the top of CNN, a softmax function outputs a vector of probability where each value is the probability of class membership for each route/label. A trajectory is abnormal if its highest-class probability is below a given threshold.
STOD [6] is also a supervised method that detects anomalous bus trajectories and learns to classify bus trajectories in their routes using a deep-learning network. The model outputs the routes’ class distribution of a given trajectory. From this distribution, it calculates the uncertainty of the classifier using entropy as a measure of anomaly degree. The higher the classifier uncertainty, the higher the entropy. A trajectory is anomalous if the entropy of the classifier’s probability distribution output for it is higher than a threshold.
GM-VSAE [13] uses an encoder-decoder strategy to detect anomalous trajectories. To perform that, firstly, the encoder infers a disentangled latent space to discover the distribution of each trajectory based on this space. This distribution is then fed to the decoder that generates a trajectory. The method calculates a score comparing the generated trajectory with the input trajectory. A high score $\approx $ 1.0 means that the input trajectory has a high probability of being an anomaly.
iBOAT [5] is based on the isolation mechanism [11], and an adaptive windows approach. It performs the detection of both trajectory and sub-trajectory anomaly detection. iBOAT calculates the frequency of points mapped into a grid cell to isolate “few and different” points. Based on that, trajectories that visit cells with low frequency get small scores, meaning that those points are highly likely to be an anomaly. Conversely, trajectories with high-frequent visited cells have high scores of non-anomaly. Similar to the previous methods, iBOAT also needs a threshold to detect anomalies.
Transformer is our proposed approach^{Footnote 6}. We train our model on both datasets with the hyper-parameter values shown in Table 2.

It is worth pointing out that all those methods identify anomalous trajectories, but only our approach (Transformer) and iBOAT detect anomalous sub-trajectories. We randomly selected 8,200 trajectories from the Recife dataset and 6,800 from Dublin to evaluate the approaches. Based on a data analysis, we defined the maximum number of points as 100 for the Recife dataset and 208 for Dublin.

Evaluation Metrics. We use F1-measure, Precision, Recall, and PR-AUC as evaluation metrics since they are usually applied to evaluate outlier detection methods [1, 13]. To verify whether the F1-measure values of our model are statistically different from the baselines, we execute the Wilcoxon statistical test [15]. The test verifies whether two paired samples (F1-measure values of our solution vs. a baseline) come from the same distribution. Given that, we set the significance level $\alpha =5\%$. In our context, the null hypothesis $h_0$ considers that the median difference between the F1 values of a pair of models is zero. We performed this statistical test on the instances in the test set.

6 Results and Discussion

In this section, we first present the evaluation of the trajectory outlier identification and, subsequently, the region anomaly detection.

Table 3. Results of anomaly trajectory detection on the Dublin dataset.

Full size table

Table 4. Results of anomaly trajectory detection on the Recife dataset.

Full size table

6.1 Trajectory Anomaly Detection

Table 3 presents the results for the trajectory anomaly detection task on the Dublin dataset. Transformer outperforms the baselines in all scenarios and metrics. For example, considering the best results on F1, our approach is at least $17\%$ better than all baselines. To confirm this, Table 6 depicts the p-values for the results of each baseline in comparison to Transformer on the Dublin dataset. All the p-values are smaller than the significance level (0.05), which supports that our Transformer network has, in fact, superior performance than the baselines on this dataset.

Regarding the results on the Recife dataset, presented in Table 4, our method obtains better F1-measure values than STOD and RioBusData, and comparable ones with GMVSAE and iBOAT. The p-values of the hypothesis tests on this dataset for F1-measure, Table 7, confirm this: the p-values of GMVSAE and iBOAT versus Transformer in all scenarios are higher than the significance level of 0.05, meaning there is no statistical difference in terms of F1-measure between our approach and them. Regarding RioBusData and STOD, however, the p-values are lower than 0.05 in two of the three anomaly cases. Looking at precision values, Transformer achieved the best overall results but lower recall than GMVSAE and iBOAT. In practice, better precision can be an advantage since an anomaly detection model can be considered a filter that identifies possibly a few anomalous trajectories in a large set of trajectories. The more precise this filter is, the few false negative anomalies need to be inspected.

Overall, the methods built to detect anomalies (Transformer, GMVSAE and iBOAT) outperformed in almost all scenarios the ones that try to do this indirectly (STOD and RioBusData), i.e., learning to classify routes instead of anomalies. For example, for $p=0.1$, RioBusData obtained the lowest F1 (0.651) on Dublin and Recife (0.533). However, on the Dublin dataset, STOD shows better F1 results than iBOAT for $p=0.2$ (0.69 vs 0.673 respectively), and for $p=0.3$ obtained the F1 second best value (0.724) only behind Transformer (0.854). We also observed that STOD is more sensible than our method over the percentage outlier variation p. For example, for $p=0.1$, and $p=0.3$, the F1-measure is 0.58 and 0.66 on Recife (difference of 0.8), respectively. In contrast, our method is more stable regarding the outlier level. For instance, for $p=0.1$ and $p=0.3$, the F1-measure is respectively 0.84 and 0.85 on Dublin and 0.66 and 0.67 on Recife, i.e., there is not much difference.

Table 5. Results for the region anomaly detection models.

Full size table

To provide a detailed analysis of the approaches on individual routes, Fig. 2 shows the PR-AUC curves of all methods on both datasets in two different routes, one from each dataset with $p=0.3$. In route 54 from Recife, our approach has the highest area under curve $\approx 0.98$, outperforming both the unsupervised methods (GMVSAE $\approx 0.94$ and iBOAT $\approx 0.77$) and the supervised ones (STOD $\approx 0.54$ and RioBusData $\approx 0.67$). Looking at route 1 from Dublin, we observe that the encoder-decoder methods have almost the perfect curve AUC $\approx 0.99$, i.e., the models can adequately distinguish anomalous trajectories from non-anomalous ones. Conversely, the RioBusData has the worst PR-AUC curve $\approx 0.70$. Finally, we can see that the precision of iBOAT degrades with a recall close to 1.

Table 6. Hypothesis test for F1 on the Dublin dataset.

Full size table

Table 7. Hypothesis test for F1 on the Recife dataset.

Full size table

6.2 Region Anomaly Detection

Table 5 shows the results between Transformer and iBOAT on region anomaly detection. We observe that Transformer outperforms iBOAT on the Recife dataset in all scenarios. This occurs mainly because our model achieved the high values of precision (0.988, 0.975, 0.961). The methods, however, are similar regarding recall for $p=0.1$, for instance, Transformer’s recall is 0.992, and iBoat 0.990. On the Dublin dataset, the methods are qualitatively similar. Note that the most difference between the models occurs for $p=0.1$: Transformer’s F1 is 0.986, and iBOAT’s F1 is 0.978.

We applied the Wilcoxon test to verify whether there is a statistical difference between the F1-measure values of the methods on this task. On the Recife dataset, the models are statistically different, with p-value lower than our significance level of 0.05 for all scenarios. On the Dublin dataset, however, there is no statistical evidence to reject the $h_0$, since the p-values are higher than 0.05 and, therefore, both models are statistically equivalent in terms of F-1 measure.

To present concrete examples of the detection of region anomalies on real trajectories by our model, Fig. 3a shows the expected trajectories of Dublin route 1, and Fig. 3b depicts the anomalous regions identified (represented by the red dots) by Transformer. One can see from these plots that our approach identifies the anomalous regions in this trajectory very precisely.

7 Conclusion

In this paper, we propose a solution that applies an encoder-decoder transformer language model in bus trajectory data to solve two problems: trajectory and sub-trajectory anomaly detection. Our solution transforms a trajectory into a discrete token sequence by mapping its points to tokens representing geographical grid cells. This sequence is then passed to the Transformer language model that outputs a predicted sequence, supposedly without anomalies. Finally, our solution calculates the trajectory’s anomaly score by applying the hamming distance between the two sequences and identifies the anomalous regions by looking at the unmatched tokens between them. Experiments in two real-world bus trajectory datasets demonstrate that our approach is effective for anomalous trajectory detection and anomalous region detection tasks.

In future work, we intend to train our approach in multiple trajectory datasets to verify whether it can learn general trajectory patterns (deep representation). Once our approach learns those patterns, we want to exploit other tasks, such as trajectory similarity and classification, using transfer learning.

Notes

1.
https://eng.uber.com/h3/.
2.
We use embeddings of 256 dimensions in our solution.
3.
The logit vector dimension is the total number of grid cell tokens (vocabulary).
4.
https://data.gov.ie/dataset/dublin-bus-gps-sample-data-from-dublin-city-council-insight-project.
5.
The resolution allows the library to increase or decrease the size of grid cells. Thus, the higher resolution, the smaller the cell areas.
6.
https://github.com/michaeloc/its_research.

References

Belhadi, A., Djenouri, Y., Lin, J.C.W., Cano, A.: Trajectory outlier detection: algorithms, taxonomies, evaluation, and open challenges. ACM Trans. Manage. Inf. Syst. (TMIS) 11(3), 1–29 (2020)
Article MATH Google Scholar
Bessa, A., Silva, F.D.M., Nogueira, R.F., Bertini, E., Freire, J.: RioBusData: outlier detection in bus routes of Rio de Janeiro. arXiv preprint arXiv:1601.06128 (2016)
Bouritsas, G., Daveas, S., Danelakis, A., Thomopoulos, S.C.: Automated real-time anomaly detection in human trajectories using sequence to sequence networks. In: 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–8. IEEE (2019)
Google Scholar
Chalapathy, R., Chawla, S.: Deep learning for anomaly detection: a survey. arXiv preprint arXiv:1901.03407 (2019)
Chen, C., et al.: iBOAT: isolation-based online anomalous trajectory detection. IEEE Trans. Intell. Transp. Syst. 14(2), 806–818 (2013)
Article MATH Google Scholar
Cruz, M., Barbosa, L.: Learning GPS point representations to detect anomalous bus trajectories. IEEE Access 8, 229006–229017 (2020)
Article MATH Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
Google Scholar
Kong, X., Song, X., Xia, F., Guo, H., Wang, J., Tolba, A.: LoTAD: long-term traffic anomaly detection based on crowdsourced bus trajectory data. World Wide Web 21(3), 825–847 (2018)
Article Google Scholar
Lee, J.G., Han, J., Li, X.: Trajectory outlier detection: a partition-and-detect framework. In: 2008 IEEE 24th International Conference on Data Engineering, pp. 140–149. IEEE (2008)
Google Scholar
Li, X., Zhao, K., Cong, G., Jensen, C.S., Wei, W.: Deep representation learning for trajectory similarity computation. In: 2018 IEEE 34th International Conference on Data Engineering (ICDE), pp. 617–628. IEEE (2018)
Google Scholar
Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation-based anomaly detection. ACM Trans. Knowl. Discov. Data (TKDD) 6(1), 1–39 (2012)
Article MATH Google Scholar
Liu, S., Ni, L.M., Krishnan, R.: Fraud detection from taxis’ driving behaviors. IEEE Trans. Veh. Technol. 63(1), 464–472 (2013)
Article Google Scholar
Liu, Y., Zhao, K., Cong, G., Bao, Z.: Online anomalous trajectory detection with deep generative sequence modeling. In: 2020 IEEE 36th International Conference on Data Engineering (ICDE), pp. 949–960. IEEE (2020)
Google Scholar
Luu, V.T., Forestier, G., Weber, J., Bourgeois, P., Djelil, F., Muller, P.A.: A review of alignment based similarity measures for web usage mining. Artif. Intell. Rev. 53(3), 1529–1551 (2020)
Article Google Scholar
Siegel, S.: Nonparametric statistics for the behavioral sciences (1956)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Wang, J., et al.: Anomalous trajectory detection and classification based on difference and intersection set distance. IEEE Trans. Veh. Technol. 69(3), 2487–2500 (2020)
Article MATH Google Scholar
Zhang, D., Li, N., Zhou, Z.H., Chen, C., Sun, L., Li, S.: iBAT: detecting anomalous taxi trajectories from GPS traces. In: Proceedings of the 13th International Conference on Ubiquitous Computing, pp. 99–108 (2011)
Google Scholar
Zhang, D., Chang, Z., Wu, S., Yuan, Y., Tan, K.L., Chen, G.: Continuous trajectory similarity search for online outlier detection. IEEE Trans. Knowl. Data Eng. (2020)
Google Scholar
Zhang, Y., Ning, N., Zhou, P., Wu, B.: UT-ATD: universal transformer for anomalous trajectory detection by embedding trajectory information. In: Proceedings of the 27th International Conference on Distributed Multimedia Systems (2021)
Google Scholar
Zhao, X., Rao, Y., Cai, J., Ma, W.: Abnormal trajectory detection based on a sparse subgraph. IEEE Access 8, 29987–30000 (2020)
Article Google Scholar
Zheng, G., Brantley, S.L., Lauvaux, T., Li, Z.: Contextual spatial outlier detection with metric learning. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2161–2170 (2017)
Google Scholar
Zheng, Y.: Trajectory data mining: an overview. ACM Trans. Intell. Syst. Technol. (TIST) 6(3), 1–41 (2015)
Article MATH Google Scholar

Download references

Acknowledgment

This work is partially supported by INES (www.ines.org.br), CNPq grant 465614/2014-0, FACEPE grants APQ-0399-1.03/17 and APQ/0388-1.03/14, CAPES grant 88887.136410/2017-00.

Author information

Authors and Affiliations

Universidade Federal de Pernambuco - Centro de Informática, Recife, Brazil
Michael Cruz & Luciano Barbosa

Authors

Michael Cruz
View author publications
Search author on:PubMed Google Scholar
Luciano Barbosa
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Michael Cruz .

Editor information

Editors and Affiliations

Universidade Federal Fluminense, Niterói, Brazil
Aline Paes
Instituto Tecnológico de Aeronáutica, São José dos Campos, Brazil
Filipe A. N. Verri

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cruz, M., Barbosa, L. (2025). Applying Transformers for Anomaly Detection in Bus Trajectories. In: Paes, A., Verri, F.A.N. (eds) Intelligent Systems. BRACIS 2024. Lecture Notes in Computer Science(), vol 15412. Springer, Cham. https://doi.org/10.1007/978-3-031-79029-4_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-79029-4_12
Published: 30 January 2025
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-79028-7
Online ISBN: 978-3-031-79029-4
eBook Packages: Computer ScienceComputer Science (R0)

Keywords

Publish with us

Policies and ethics

Applying Transformers for Anomaly Detection in Bus Trajectories

Abstract

Similar content being viewed by others

Detecting Anomalous Bus-Driving Behaviors from Trajectories

TS-DBSCAN: To Detect Trajectory Anomaly for Transportation Vehicles

Anomalous Trajectory Detection Between Regions of Interest Based on ANPR System

1 Introduction

Example 1:

Example 2:

2 Related Work

3 Problem Formulation

Definition 1

Definition 2

Definition 3

4 Method

4.1 Grid Mapping

4.2 Transformer Encoder

4.3 Transformer Decoder

4.4 Training

4.5 Anomaly Detector

5 Data Description and Setup

5.1 Experimental Setup

6 Results and Discussion

6.1 Trajectory Anomaly Detection

6.2 Region Anomaly Detection

7 Conclusion

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Keywords

Publish with us