key: cord-0043210-umxw9qlb authors: Yamamoto, Shuhei; Kurashima, Takeshi; Toda, Hiroyuki title: Identifying Near-Miss Traffic Incidents in Event Recorder Data date: 2020-04-17 journal: Advances in Knowledge Discovery and Data Mining DOI: 10.1007/978-3-030-47436-2_54 sha: 629b3e45747723ae17f34388df923f9ae7cdbc30 doc_id: 43210 cord_uid: umxw9qlb Front video and sensor data captured by vehicle-mounted event recorders are used for not only traffic accident evidence but also safe-driving education as near-miss traffic incident data. However, most event recorder (ER) data shows only regular driving events. To utilize near-miss data for safe-driving education, we need to be able to easily and rapidly locate the appropriate data from large amounts of ER data through labels attached to the scenes/events of interest. This paper proposes a method that can automatically identify near-misses with objects such as pedestrians and bicycles by processing the ER data. The proposed method extracts two deep feature representations that consider car status and the environment surrounding the car. The first feature representation is generated by considering the temporal transitions of car status. The second one can extract the positional relationship between the car and surrounding objects by processing object detection results. Experiments on actual ER data demonstrate that the proposed method can accurately identify and tag near-miss events. Recently, the event recorder 1 has become an almost obligatory car accessory. Modern recorders can capture a front video, several sensor streams, and driving operation. The event recorder permanently stores all data dozens of seconds on either side of the trigger of longitudinal/lateral acceleration/deceleration exceedings a certain level. In this paper, we call such data event recorder (ER) data. ER data is being effectively used as traffic accident/violation evidence. In addition, ER data that demonstrates near-miss traffic incidents ("near-miss"), such as near collisions between the car and other obstacles, is being considered for use in reducing traffic accidents. Actual examples of near-miss scenes captured by ERs are shown in Fig. 1 . The ER data of near-misses is best utilized pro-active education that targets safer driving. An example of safe-driving education is to have drivers watch actual ER footage of near-miss traffic incidents [17] . In addition, near-miss incidents in ER data are attracting the attention of fleet management companies that need to control scores of commercial motor vehicles such as vans and trucks. For example, car leasing and commercial trucking companies can evaluate each driver's skills by processing the front video captured by Internet-connected cameras 2 . A car insurance company is detecting dangerous areas in town and creating hazard maps based on traffic accidents or near-miss as found in ER data 3 . As just described, various services/applications are using the near-miss events present in ER data; they represent new opportunities for eliminating or minimizing the risks associated with vehicle operation. However, most ER data doesn't include near-miss incidents ("no nearmiss"). One report [6] claimed that about 70% of ER data contains no near-miss incident. This is because the acceleration limits used to trigger the ER can be exceeded by rough roads and abrupt driving inputs. Moreover, actual safe-driving education organizers expect the ER data to be tagged and sorted according the type of incident (e.g. pedestrian and bicycle) because they want to extract the best possible videos as safe-driving education material for each incident type. Unfortunately, manually identifying and labelling all near-miss incidents from the large amount of ER data available is too time consuming, expensive, and error prone. Therefore, the automating the process is essential to reducing the cost of safe-driving education and strengthening the effective use of ER data. The objective of this paper is to automatically detect the presence of near-miss incidents and then accurately identify near-miss type. To achieve this objective, the straightforward approach is to build a multiclass classification model. ER data is multi-modal data consisting of video and sensor readings, and it is considered necessary to use all the data in combination for identifying near-miss incidents. The state of own vehicle and its surroundings is mainly determined from sensor readings and video. Both are key information for determining whether an ER data segment contains a near-miss or not. Thanks to advances in deep neural networks (DNNs), we can now handle such data by convolutional neural networks (CNNs) [3] as well as recurrent neural networks (RNNs) [8] . Passing the image frame data through a CNN will yield feature vectors, and the feature vectors of image frames and sensor streams can be integrated using a full connect neural network; the resulting time-series data can be modelled by an RNN. Although this approach can detect near-miss incidents (i.e., determine the presence or absence of a near-miss event), it is not accurate in terms of classifying incidents the according to its type. There are two reasons for this failure. The near-miss detection task doesn't require detailed information of the obstacle captured by the front video because it is sufficient that just some kind of obstacle is detected. This involves using a CNN to extract basic visual features. However, the task of classifying the near-miss incidents requires an understanding of the kind of object and its position relation to the car. Simple CNNs can't extract visual features with sufficient detail. Issue 2: The task of identifying near-miss incidents can be treated as a two-level hierarchy classification task. First, each ER segment is classified into nearmiss or no near-miss. Second, the near-miss object in each ER segment is identified. However, general multi-class classification frameworks don't provide such a hierarchical architecture, and instead attempt to solve the two classification tasks simultaneously (i.e. treat the task as a one-level classification task). This makes the task more complex which degrades classification accuracy. To resolve these two issues, this paper proposes a classification method that combines a supervised DNN to process object detection results with multi-task learning. The proposed method has three main components. The first component, the Temporal Encoding Layer, generates a feature vector by encoding frame images, sensor streams, and object detection results as time-series data. The second component, the Grid Embedding Layer, creates a feature vector by embedding object detection results into a grid space by determining the positions of each object relative to the car. The third component, the Multi-task Layer, splits the main task into two sub-tasks to classify near-miss type. We conduct experiments on an actual ER dataset to evaluate the effectiveness of the proposal. Our result shows that the proposed method can well handle ER data with improved performance. Several studies have focused on near-miss traffic incident detection (i.e., determine the presence or absence of a near-miss event) from dashboard camera (dashcam) data. Suzuki et al. [15] estimate the risk level for each frame image in front video by using CNN, which is a highly effective DNN architectures. Their model demonstrated improved accuracy in near-miss detection by introducing pedestrian detection task as sub-function. While their model detect near-miss scenes using front video, they do not consider the classification of the near-miss incidents. Dashcam data has been used for various tasks other than near-miss detection. By extracting driver operations from dashcam data, Yokoyama et al. [20] use feature engineering to detect the drivers with dangerous driving styles. Front video is a significant part of autonomous vehicle driving technology. To permit autonomous control of vehicle movement, Jain et al. [5] predict driving movements such as straight, left/right turn, lane change, and stop based on front video information using in-vehicle cameras; their prediction model analyzes the features of the driver's face. Our work differs from theirs as regards the goals and model proposed. Our approach is motivated by the success achieved by using DNNs to analyze video data. The DNN components of CNN and RNN are widely used for human activity recognition. Sharma et al. [12] introduced a visual attention mechanism based on DNN for extracting characteristic regions in each frame image; they used it to encode feature vectors extracted by CNN. Simonyan et al. [13] proposed a spatio-temporal approach that uses both optical flow and normal images with the intention of capturing the movements of objects present in videos. Our experiments, shown in Sect. 5, evaluates the effectiveness of human activity recognition schemes for identifying near-miss incidents. Data Format: Each ER segment consists of a sequence of frame images combined with the data streams output by several sensors. Sequence length is taken to be the number of frames in the ER sequence, T . The sensor data at each timestep is a vector consisting of several dimensions such as longitudinal/lateral acceleration and speed. We normalize the sensor data in each dimension to N (0, 1) because the dimensions have different value scales. Object Detection: To correctly identify near-miss type, our approach uses the object detection results of image {I t } T t=1 . For this we employ YOLO [10] , which is one of the most effective DNN-based object detection algorithms. The object detection result of image I t consists of N t objects. Each detected object, n, consists of the triple {o t,n , l t,n , p t,n }. The one-hot vector o t,n = {o t,n,v } V v=1 is the object type where V is the number of object types, and the bounding box vector, l t,n = {x lef t,n , y top t,n , x rig t,n , y bot t,n }, specifies the object's coordinates (left, top, right and bottom) in the image; the detection probability vector Annotation Label: The application of supervised machine learning is assumed to yield the correct label for near-miss target y m ∈ R C , which is one-hot vector consisting of the number of label types C. We extract two additional kinds of correct labels by re-organizing the near-miss target label y m . The first additional label, y s1 , identifies near-miss (y s1 = 1) or no near-miss (y s1 = 0). The second one, one-hot vector y s2 ∈ R C−1 , identifies the near-miss incidents for each ER sequence other than those identified as no near-miss. The proposed method is composed of three main components (Fig. 2) . We describe these components in Sects. 4.1, 4.2, and 4.3. The objective of this layer is to generate a feature vector by considering the temporal transitions present in the time-series data. Image Encoder: To obtain holistic features such as the surrounding environment from front video, we encode each video image into a feature vector by using CNN. Here, to extract visual features from each image, we prepare two types of GoogLeNets [16] pretraind by ImageNet [11] and Places365 [21] . The GoogLeNets of this paper encode each image I t into two feature vectors. Next, these feature vectors are encoded by a full connect neural network (FC) into a feature vector with dimension of U . The feature vector extracted by this process from frame number t is denoted as h img t . Sensor Encoder: To obtain features that describe the car status, we use FC to encode the sensor data into a feature vector with U dimensions. The feature vector extracted by this process from frame number t is denoted as h sen t . Object Encoder: To extract in detail features such as obstacles and traffic signs present in the front video, we use the object detection results after translating them into a simple vector representation. Here, we focus on the appearance degree of each object and generate vector e t , which refers to the number of object types, V . The score is calculated by e t = Nt n=1 o t,n · p t,n . If several identical objects are detected in an image, in order to enhance the appearance degree of the object, the score is calculated by summing object detection probability p t,n . Next, the generated feature vector e t is encoded into a feature vector with U dimensions by FC; this yields, for frame number t, h obj t . Input: {{ot,n} N t n=1 } T t=1 , {{lt,n} N t n=1 } T t=1 , H, W, G h , Gw Output: G Initialize: G ∈ R G h ×Gw ×V ← 0; (Sw, S h ) ← (W/Gw, H/G h ); for t = 1 to T do for n = 1 to Nt do (x1, y1, x2, y2) ← ( x lef t,n /Sw , y top t,n /S h , x rig t,n /Sw , y bot t,n /S h ); r ← {(x rig t,n − x lef t,n ) × (y bot t,n − y top t,n )}/(W × H); for i = y1 to Grid Embedding: The objective of this layer is to derive a feature vector that can be used to identify near-miss targets; it does so by considering the bounding box information of each object in each frame image. In this paper, we propose a grid embedding method for utilizing bounding box information; we focus on the position of each object in the image and consider the position relationship between the car and each object. This method prepares grid space G ∈ R G h ×Gw×V by setting appropriate vertical and horizontal grid dimensions (G h and G w ); it then embeds the objects into grid space G. The embedded grid feature matrix G is generated by Algorithm 1. An example of the grid embedding flow is shown in Fig. 3 . As the embedding score for each cell, we employ the 2D area ratio r because we prioritize the distance between the car and each object. We think that the area ratio can represent the distance between the car and each object in the image because the area ratio of an object is inversely proportional to its distance from the car, i.e., objects close to the car have larger area ratios than far objects. We can obtain grid features g i,j by the above processes. Not all cells are important in the task of identifying near-miss incidents because the image captured strongly depends on the setting position of the ER. For example, as shown in Fig. 1 , the car's bonnet occupies significantly different parts of the image if the ER's direction and position are changed. Moreover, such cells don't contribute to achieving our goal because objects will not appear there. Therefore, it is not appropriate to directly use grid importance. In this paper, we employ the soft attention mechanism to calculate feature vector h gr as follows: are the DNN model parameters. These formulas mean that attention weight α g i,j is dynamically estimated from grid feature g i,j as grid importance, and the feature vector h g is calculated based on attention weights and grid features. The objective of this layer is to identify the near-miss target using the two feature vectors obtained in Sects. 4.1 and 4.2, respectively. First, we concatenate the feature vectors h tg = [h te ; h gr ]. Here, we utilize a multi-task learning framework by setting two simple sub-tasks as part of the main task. The first sub-task determines the presence or absence of a near-miss event for each ER sequence. We encode h tg to scalar valueŷ s1 , which is the output of this sub-task, by using FC and sigmoid function and calculating cross entropy error L s1 between the correct label y s1 and resultŷ s1 as follows: where D and d are the number of training data and the index used in scanning the training data, respectively; d is used to link y s1 andŷ s1 , but is omitted in this paper. The second sub-task identifies the near-miss incidents for each ER sequence other than those identified as no near-miss. We encode h tg into vectorŷ s2 , the result of this sub-task, by FC and softmax function, and then calculate cross entropy error L s2 between the correct label y s2 and resultŷ s2 as follows: k logŷ s2 k . We concatenate the results using the form h = [h tg ;ŷ s1 ;ŷ s2 ]. We can now consider the results of these simple sub-tasks. We encode h into vectorŷ m which represents the result of the main task, by FC and softmax function and calculate cross entropy error L m between the correct label y m and resultŷ m as follows: We optimize the objective function L = L m +β ·(L s1 +L s2 ) which includes the errors of these three tasks. β denotes the hyper-parameter used for controlling sub-task errors. The label yielded by the main task is given by extracting the index with maximum score from the resultŷ m . The general aim of multi-task learning is to leverage useful information contained in multiple related tasks to improve the generalization performance. Learning multiple tasks jointly can lead to significant performance improvements compared with learning them individually as has been several related works [1, 2] . For example, [2] jointly learns representations of words, entities, and meaning representations via multi-task learning. [1] shows the effectiveness of this approach with regard to various natural language processing tasks including part-ofspeech tagging, chunking, named entity recognition, and semantic role labeling. Following the success of these multi-task learning approaches, the innovation of our multi-task learning is to learn a classifier specific to each sub-task; we extract effective features and obtain new feature vectors for performing each sub-task. All features including frame image, sensor streams, and object detection results can be useful for determining whether each video contains a near-miss incident or not. However, once we know that a video contains a near-miss, object detection results constitute the most helpful information for determining the near-miss targets because they allow us to understand the kind of objects around the car. While single task learning must learn such features implicitly, our multi-task learning can learn them explicitly as isolated features. The experimental evaluation uses the Near-Miss Incident Database provided by the Smart Mobility Research Center of Tokyo University of Agriculture and Technology 4 in Japan. The dataset is a collection of data captured by ERs mounted in Japanese taxis. Each ER data sequence is 15 s long; 10 s before the trigger and 5 s after the trigger. Each sequence was manually assigned one of five risk levels {high, middle, low, bit, no near-miss} and six near-miss incident types {car, bicycle, motorcycle, pedestrian, self, other} by experts. The experiment focused on five near-miss incidents {car, bicycle, motorcycle, pedestrian, self 5 }. 700 sequences were randomly extracted for each near-miss incident type. 700 sequences tagged no near-miss were also randomly extracted. Therefore, the experiment examined 4,200 sequences with 6 labels (C = 6). We randomly split the dataset into 2,940 (70%) sequences as training data and 1,260 (30%) as test data. Each sequence was recorded at 30 frames per second and so consisted of 450 frames. In this paper, we sampled T = 30 frames at intervals of 15 frames. Each image had resolution of W = 640 and H = 400 in RGB format. The original images were processed by YOLO for object detection. This yielded V = 69 object types. For visual feature extraction, linearly transformed images (224 × 224 byte resolution) were processed by two GoogLeNets. For the sensor data, we extracted three sensor streams: speed and longitudinal/lateral acceleration. For the DNN in the proposed method, we set the number of hidden units in each FC to U = 256, and the output vector after each FC is non-linearly transformed by the ReLu function [9] with Dropout p = 0.7 6 [14] . To examine the effectiveness of the proposed method in identifying near-miss incidents, we use three evaluation metrics: precision, recall, and F1-score [7] . We show the classification performance with the proposed and four baseline methods in Table 1 . The baselines are as follows. DNN: Straightforward approach using DNN (i.e., TEL without objects). SVM: SVM using three information sources (video, sensor, and object). In order to use the ER sequence as SVM input, we transformed each information source into a vector space and concatenated them over all frames. IDT [18] : It was proposed for recognizing human activity in video and is one of the SOTA methods for extracting video features; IDT identifies several visual key points and uses the trajectories of key points to characterize each video. Each video is then converted into a K-dimensional feature vector by Kmeans clustering all videos (we set K to 200). We use the IDT-based features to train the SVM classifier. ST-CNN [13] : It was proposed for recognizing human activity in video and is another SOTA method. The method combines two types of CNNs; one is a spatial CNN for capturing scenes and objects depicted in the video, and the other is a temporal CNN for capturing motions between frames. ST-CNN calculates average scores for these two feature vectors. "N", "C", "B", "M", "P", and "S" mean "No near-miss", "Car", "Bicycle", "Motorcycle", "Pedestrian", and "Self", respectively. For all evaluation metrics, the proposed method achieved the highest values among the compared methods. The results indicate the effectiveness of our approach in terms of the near-miss traffic identification task for ER data. Note that we conducted the χ 2 test based on cross tabulation (joint frequency distribution of cases/tests) with two categorical variables (i.e., proposed and each baseline); each variable can be correct or incorrect. The results confirmed that the proposed method is significantly better than the baselines (p-value < 0.01). Figure 4 uses a confusion matrix to show the detailed classification results of Proposed, SVM, and ST-CNN. In this figure, the true labels and predicted labels are plotted on the horizontal and vertical axes, respectively. The number in each cell shows the number of tests with each label. The proposed method accurately identified more objects than the other two methods, except for no near-miss and self labels, which confirms the superior performance of the proposed method. The proposed method uses soft attention for temporal and grid space processing in TEL and GEL, respectively. By calculating mean values of each soft attention of α τ t and α g i,j for the correct labels in the test data, we can compare the time and space attributes emphasized by the proposed method. The mean attention scores α τ t calculated for each correct label are shown in Fig. 5 . The vertical and horizontal axes are averaged attention scores α τ t over test data and frame number t. The trigger frame number is t = 20. The scores of the near-miss targets of car, bicycle, motorcycle, and pedestrian peaked at around frame number t = 25. On the other hand, the self label attained highest attention score toward the last frame, t = 30, while no near-miss attained its peak score at frame number t = 21. As demonstrated by these results, the labels of self and no near-miss have different characteristics from the other labels; the four other labels demonstrate a similar tendency in terms of α τ t . The mean attention scores α g i,j calculated for each correct label are shown in Fig. 6 . In this figure, the color intensity represents the mean attention score value in each cell. Cells on the left side of all figures are higher than those on the right. We think this is because vehicles and bicycles drive on the left side in Japan. Cells in the low center region have lower values as this region is often occupied by the car's bonnet. The pedestrian label has high attention scores in the vertical column of center cells. This result suggests that pedestrians frequently appeared in this region. We consider that GEL contributes the improvement of estimation performance by considering grid importance when processing ER data. This paper proposed a classification method that can well utilize in a coherent manner the data provided by front video, sensor streams, and object detection results, to accurately label near-miss events in the data captured by ERs (dashcams). The proposed method has three components. Temporal Encoding Layer; feature encoding for multi-modal and time-series data. Grid Embedding Layer; feature embedding to place detected objects into the grid space set relative to the vehicle. Multi-task Layer; multi-task learning utilizing sub-tasks developed from the main task. An experiment using actual ER data confirmed the performance improvements attained by the proposed components. We intend to develop a semi-supervised model to handle small amounts of training data and extend the model to support the multi-labeling of events. Joint learning of words and meaning representations for open-text semantic parsing Natural language processing (almost) from scratch Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position Long short-term memory Recurrent neural networks for driver activity anticipation via sensory-fusion architecture Research on incident analysis using drive recorder part 1: toward database construction N -gram similarity and distance Recurrent neural network based language model Rectified linear units improve restricted Boltzmann machines YOLO9000: better, faster, stronger ImageNet large scale visual recognition challenge Action recognition using visual attention Two-stream convolutional networks for action recognition in videos Dropout: a simple way to prevent neural networks from overfitting Pedestrian near-miss analysis on vehiclemounted driving recorders Going deeper with convolutions Elderly driver retraining using automatic evaluation system of safe driving skill Action recognition with improved trajectories Hierarchical attention networks for document classification Understanding drivers' safety by fusing large scale vehicle recorder dataset and heterogeneous circumstantial data Places: a 10 million image database for scene recognition