key: cord-0488615-3e4wm8ql
authors: Sushmit, Asif Shahriyar; Ghosh, Partho; Istiak, Md.Abrar; Rashid, Nayeeb; Akash, Ahsan Habib; Hasan, Taufiq
title: SegCodeNet: Color-Coded Segmentation Masks for Activity Detection from Wearable Cameras
date: 2020-08-19
journal: nan
DOI: nan
sha: d4953e2a564f68bc7eccd8d1af12da3c2c5bbf7c
doc_id: 488615
cord_uid: 3e4wm8ql

Activity detection from first-person videos (FPV) captured using a wearable camera is an active research field with potential applications in many sectors, including healthcare, law enforcement, and rehabilitation. State-of-the-art methods use optical flow-based hybrid techniques that rely on features derived from the motion of objects from consecutive frames. In this work, we developed a two-stream network, the emph{SegCodeNet}, that uses a network branch containing video-streams with color-coded semantic segmentation masks of relevant objects in addition to the original RGB video-stream. We also include a stream-wise attention gating that prioritizes between the two streams and a frame-wise attention module that prioritizes the video frames that contain relevant features. Experiments are conducted on an FPV dataset containing $18$ activity classes in office environments. In comparison to a single-stream network, the proposed two-stream method achieves an absolute improvement of $14.366%$ and $10.324%$ for averaged F1 score and accuracy, respectively, when average results are compared for three different frame sizes $224times224$, $112times112$, and $64times64$. The proposed method provides significant performance gains for lower-resolution images with absolute improvements of $17%$ and $26%$ in F1 score for input dimensions of $112times112$ and $64times64$, respectively. The best performance is achieved for a frame size of $224times224$ yielding an F1 score and accuracy of $90.176%$ and $90.799%$ which outperforms the state-of-the-art Inflated 3D ConvNet (I3D) cite{carreira2017quo} method by an absolute margin of $4.529%$ and $2.419%$, respectively.

The wide-spread availability of body-worn cameras enables a multitude of new applications in daily life using firstperson videos. With the large quantity of video data becoming available, automatic processing of FPVs has become a topic of greater interest for human activity recognition (HAR). Various application domains of HAR include activity logs, law enforcement, healthcare, search and rescue missions, inspections, home-based rehabilitation, sporting activity observation [2] and wildlife observation [3] . In the wake of the COVID-19 pandemic, activity recognition from videos can become vital in detecting if the users are adhering to social distancing and hand hygiene guidelines. These applications can be of significance once the governments are beginning to resume regular activities. A vast amount of research work has already been done in the area of HAR. The research topic of HAR can be broadly classified into three different streams [4] : (i) radio frequencybased, (ii) sensor-based [5] , and (iii) video-based. The focus of the current work is on videos, which can again be categorized into two types, namely third-person video (TPV) and firstperson video (e.g., recorded by a wearable camera). Some of the most popular datasets on available for human activity recognition are UCF101 [6] , HMDB-51 [7] , THUMOS [8] and the Kinetics datasets [9] , [10] and [11] , all of which mostly contain third-person videos of people performing different tasks. The first publicly available dataset consisting of FPV for activity recognition was collected in a controlled office setting [12] . Other available FPV datasets include CMU-MMAC [13] , GTEA-11 [14] , VNIST [15] , HUJI EgoSeg [2] , [16] . These datasets offer videos from the head-, shoulder-, and chest-mounted wearable cameras. For both FPV and TPV domains, the major challenge in automatic HAR includes scale and texture variation, low-resolution, motion-blur, illumination changes, context analysis, and self-occlusions [17] , [18] . FPV, on the other hand, poses some additional challenges due to a more dynamic background, scarcity of information, and unstable perspective.

The existing literature on activity recognition using image and video processing techniques can be broadly classified into methods based on traditional features [19] - [22] , and neural networks. Traditional methods mainly focus on designing useful features to be extracted from the video frames to be classified by machine learning methods. The unimodal (i.e., single-stream) methods for human activity classification can be divided into four categories [23] : space-time methods, rulebased methods, shape-based methods, and stochastic methods.

Methods involving a neural network can be further categorized into four groups that are based on: (i) single RGB stream videos, (ii) optical-flow, (iii) pose estimation, and (iv) a hybrid approach. RGB stream based techniques [24] , [25] generally include a feature extractor that inputs the original video frame pixels and a recurrent neural network for classification. Popular feature extractors include variations of AlexNet [26] , ResNet [27] , Wide ResNet [28] , ResNeXt [29] , DenseNet [30] and Xception [31] . For the recurrent network used for classification, one or more uni/bi-directional LSTM (Long Short Term Memory) layers are used [32] . Pose estimation based methods determine the position and orientation of the different limbs of the human beings present in the fields of view [25] , [33] . Hybrid methods [34] , [35] include methods that augment the feature extraction process by providing additional information via network branches. Notable methods include additional features extracted using eye and ego motion [36] , pose estimation [37] , hand segmentation [38] and opticalflow [39] , [40] . The optical flow based methods [1] , [41] are currently the state of the art for the popular activity classification datasets such as UCF-101 [6] and HMDB-51 [7] .

Optical flow computation from successive video frames creates binary or colorful masks that attribute different contrast/color values for each of the pixels that are changing temporally. However, for non-stationary scenes, optical flowbased systems suffer from noisy features and classification errors due to a rapidly changing background. To address this problem, [42] proposes a neural network-based approach that attempts to remove the noise-induced in the optical flow due to motion artifacts. However, a fundamental problem with the optical flow based methods is the inherent assumption that anything that is moving is important for activity classification, which may not always be correct.

In this work, we propose the SegCodeNet, a simple but effective two-stream approach that leverages the information extracted from task-relevant object segmentation masks. Previous research [14] shows that knowing the action improves object recognition performance. Conversely, we hypothesize that knowing the objects should help in classifying action. Following the hybrid methods [34] , [35] , the proposed SegCo-deNet includes an additional branch in the network, including segmentation masks from objects relevant for classification. We first generate semantic segmentation masks from the input video frames using a Mask R-CNN network [43] . Instead of providing a collapsed binary segmentation mask that is not effective for multiple objects [34] , we propose a novel colorcoded mask where unique colors are attributed to multiple task-relevant objects. This approach simultaneously retains information regarding the object's presence, boundary, and motion information that can be utilized by the subsequent feature extractor and classifier network. We expect our approach to be superior to optical flow-based techniques since only task-relevant objects are visible in the segmentation stream, which is not affected by the movements in the background. The proposed network thus exploits the interrelation between the objects present in the field of view and their contribution to activity recognition. The weighted and merged features from the masked and the RGB stream enable our model to discover detailed Spatio-temporal patterns with enriched semantic information. The presented system also incorporates stream-wise and frame-wise attention gates to ensure the prioritization of the most relevant features.

This paper is organized as follows. Section II describes the dataset used for the study and discusses the challenges involved. In Sec. III, we describe the proposed two-stream architecture and its various modules in detail. Section IV mentions the two state-of-the-art baseline models used for comparison, followed by detailed experimental evaluations in Sec. V. Results are further discussed in Sec. VI before the paper is concluded in Sec. VII.

The dataset used in this work has been provided through the IEEE VIP Cup 2019 competition [44] , [45] which consists of first-person videos in office settings. We refer to this dataset as FPV-O for the remaining of this paper. The videos in this dataset were collected using a chest-mounted GoPro Hero3+ Camera with a resolution of 1280 × 760 pixels and a 30 fps frame rate [45] . There are four types of human activities present in FPV-O data: (i) ambulatory motion, (ii) human to human interaction, (iii) human to object interaction, and (iv) solo activity. Overall, the dataset contains a total of 18 activity classes. A small percentage of the videos in this dataset was recorded outdoors, where the lighting effect was noticeably different and had a frame-rate of 120 fps. A summary of all the classes and the percentage of associated frames is presented in Fig. 2 . From the figure, it is evident that significant class imbalance exists in the dataset, which needs to be addressed. The number of video segments in training and testing were 1230 and 568, respectively.

The overall workflow of our architecture is presented in Fig. 3 . Different blocks of the proposed activity classification system are described in the following sub-sections. 

First, we sub-sample the videos in order to reduce the computational load. The sub-sampling process extracts a fixed number of video frames from each video file irrespective of its length. The process described is as follows.

Let the n-th frame of a video segment denoted by V[n], where n ∈ [1, N ]. We first compute the average sampling period, τ = N/k , where · denotes the floor operation and k denotes the fixed number of frames to extract. The first frame of the sub-sampled video is randomly selected as

where Rand(i, j) selects a random integer between [i, j] (inclusive set). The subsequent frames are uniformly selected between an interval of τ frames. Therefore, the i-th frame of the sub-sampled video segment can be obtained aŝ

This sub-sampling scheme ensures that the information content of the entire video is captured within a fixed number of frames. For the FPV-O dataset, we use k = 40 as we have empirically found that this particular value performs better compared to k = 20 and k ≥ 60. Another value of k may be more suitable for a different datasets.

The mask-based stream in the proposed network acts as a surrogate of the conventional optical-flow based streams used in state-of-the-art systems [1] , [41] . First, we manually identify important objects for each activity class in the FPV-O dataset. For instance, "digital screen", "laptop", "paper", "person" (hand), etc. objects are usually observed in the "Read" activity class. Based on all the target activity classes, we identify an important set of objects and assigned a unique color to each of them, as mentioned in Table I . These 14 objects are a subset of the 80 objects available in COCO dataset [46] . To generate the masks, we first perform instance segmentation using the Mask R-CNN network [43] from each video frame. Next, the classrelevant objects (Table I) are filtered out from the segmented mask, followed by assigning the assigned color-code to obtain the final segmentation mask. The entire process is summarized in Fig. 4 .

Different networks pre-trained on the COCO dataset can be used as a feature extractor with a Mask R-CNN model. Popular networks include InceptionV2 [47] , ResNet50 [27] , ResNet101 [27] and Inception ResNet V2 [47] with Atrous convolution. In our system, we use the Inception ResNet V2 as it is known to provide superior segmentation performance compared to the other networks [48] .

Our fundamental assumption is that if these class-relevant objects are segmented and color-coded to generate a separate video-stream, the classifier can perform better with this additional information embedded within this video stream. Table I .

In our proposed two-stream architecture, we use a ResNeXt-50 [29] feature extractor for both of the streams for activity classification in the next stage. Each sub-sampled frame is passed through the feature extractors in the two streams to obtain the corresponding feature vector. The feature vectors extracted from the n-th frame from the RGB and Mask-based streams are denoted by F RGB [n] and F mask [n], respectively. We note that, the proposed architecture is not dependent on a specific feature extractor and alternative features could also have been used. Our primary motivation is that the original RGB video stream and the color-coded segmentation mask stream will provide complementary information that will eventually help the activity classifier in providing improved performance.

The next stage of our architecture includes a stream-wise attention module that multiplies the extracted feature vectors F RGB [n] and F mask [n] by the learn-able scalar parameters η RGB ∈ [0, 1] and η mask ∈ [0, 1], respectively. The attention values for the RGB and masked streams are independently learned. We presume that for some activity classes the RGB video-stream contains more activity-relevant information compared to the masked stream, and vise versa.

After the attention layer, the feature vectors extracted from the two streams are concatenated and passed to the Bidirectional LSTM layer [32] for temporal feature analysis. The feature vector received in this layer can be denoted as

where is the matrix concatenation operator. The bidirectional LSTM module consists of a single hidden layer. During our experiments, we observed that increasing the number of layers did not provide any significant advantage.

Naturally, all the frames sampled from a video do not include frames containing relevant activity information. Thus, we also use a frame-wise attention module to provide a higher weight to the feature vectors corresponding to specific video frames from the sub-sampled data that are more important for classification.

To evaluate the performance of our system, we have compared the proposed network with two state-of-the-art architectures for activity classification, including the I3D model [1] and CNN feature-based Bi-directional LSTM model proposed by Amin Ullah et al. [24] .

The I3D model [1] is an optical flow-based state-of-theart method achieving one of the top scores in the action classification datasets, HMDB-51 [7] and UCF101 [6] . To implement the I3D model on the FPV-O dataset, we utilize the pre-trained weights obtained from the Kinetics dataset [9] and trained it with 64 RGB and 64 flow images (computed using the TV-L1 algorithm [49] ) as described in [1] . In the two-stream I3D architecture, a 3D CNN is used [50] . In the 3D convolutional layer, an ImageNet pre-trained Inception-V1 model [51] is used as the base network with weights from the N × N filters were inflated to N × N × N by repeating the weights of the 2D filters N times along the time dimension, and dividing them by N for re-scaling. This allows the model to capture both spatial and temporal information from the videos.

This baseline follows [24] where the AlexNet [52] feature extractor is used followed by used a Bi-Directional LSTM with two hidden layers. No attention layers were used. In this work, for the sake of comparison with the two-stream network, we implement this baseline system by using only the RGB stream part of the proposed network with the ResNeXt feature extractor.

We developed our model using Pytorch. The RAdam [53] optimizer was used for training. The segmentation masks were generated using a pre-trained Mask R-CNN [43] model as described in Sec. III-B.

Before feature extraction in the two-streams, the image frames obtained from the RGB and mask video streams were resized to 64 × 64, 112 × 112, or 224 × 224 depending on the experiment. We used single-precision images to reduce the computational burden.

The RGB and mask feature extractors were trained in two steps. In the first step, the pre-trained weights from Imagenet [54] and the weights were frozen and trained for 100 epochs with a learning rate of 0.001 (using corresponding RGB or Mask video data). In the second step, we load the best weights from the first phase, unfreeze the feature extractor and train for another 100 epochs with a learning rate of 0.0001. In these two steps of training, the number of mini-batches were selected differently for experiments with different image dimensions due to memory issues. Once the two streams are trained, all the parameters of the proposed architecture are trained on the dataset except for the feature extractors in the final stage. However, for video dimension 64 × 64 it was possible to train all the parameters in the final stage. The experiments were implemented on an NVIDIA Titan Xp Graphics processing unit (GPU). The codes for the proposed system available in our Github repository 1 .

First, we compared the overall performances of the proposed two-stream method with the single-stream RGB model [24] . In this experiment, we have used different video frame sizes 224 × 224, 112 × 112, and 64 × 64 and presented the averaged performance metrics for three different image resolutions. The averaged results obtained are presented in Table II . These results show that the proposed method in comparison to the single-stream baseline provides with absolute gains of 14.366% and 10.324% with respect to averaged F1 score and accuracy, respectively. This demonstrates the effectiveness of using the segmentation mask stream that provides additional information to the activity classifier. 

To examine the effect of input video resolution, experiments were performed in three different video frame resolutions, 64 × 64, 112 × 112, and 224 × 224. The averaged class-wise F1 scores obtained from each of these video resolutions are presented in Fig. 5 . Here, we observe that the performance improvement achieved by the proposed method is more significant in the lower resolution images of dimension 64 × 64 and 112 × 112. The percent absolute improvements in F1 scores achieved by the proposed system are 1%, 17% and 26% for video resolutions of 224 × 224, 112 × 112 and 64 × 64, respectively. We performed the McNemar's statistical significance test [55] for these three experiments and found that the improvements are significant (p < 0.01) for the dimensions 64 × 64 and 112 × 112. We believe this to be a noteworthy achievement of the two-stream method.

To gain a deeper understanding of why our method is able to provide such significant improvements in lower resolutions, we identified three difficult activity classes, Clean, Chat and Drink, based on F1 scores. The class-wise F1 scores for these activities are summarized in Fig. 6 for all three video resolutions obtained from the two-stream and single-stream methods. From the results, we observe that the single-stream method completely fails to identify class Clean and class Chat (F1 scores below 10%) in the 64 × 64 resolution videos whereas the proposed method is still able to identify the activities based on the color-coded masks. From the figure it is also evident that as the frame resolution increases the performance of the activity classifier also increases.

To evaluate the impact of the stream-wise attention module, we performed experiments on the proposed system with and Fig. 7 : Average stream-wise attention weights for different classes obtained from the RGB single-stream (blue) and the segmentation mask stream (green) in the proposed architectures. We note that some classes (e.g., chat, shake) do not depend on the mask stream while other classes (e.g., mobile, paper) depend more on the color coded masked frames compared to the raw RGB frames. without this module. The results are summarized in Table  III for input video frame dimensions of 224 × 224. Here, we observe that using the stream-wise attention provides an absolute improvement in mean F1 score and accuracy by 2.14% and 0.98%, respectively. The average stream-wise attention values for different classes impact of the stream gate while making predictions is demonstrated in Fig. 7 .

To study the effectiveness of the frame-wise attention module, we ran a set of experiments including and excluding this module, while other aspects of the architecture remained the same. The results of these experiments presented in Table  IV show that the frame-wise attention provides an absolute gain in mean F1 score and accuracy by 1.34% and 1.18%, respectively. As an illustrative example, the frame-wise attention values obtained from a video sample containing the activity class chat is shown in Fig. 8 . We observe from this figure that the attention value is higher when the person is looking toward the camera and is more likely to be speaking to the subject (FPV camera bearer).

In the final evaluation, we compare the proposed system with the state-of-the-art I3D model and the single-stream baseline for the activity classification task. In this experiment, we fixed the video frame size to 224 for all methods. From the results presented in Table V , we observe that the I3D method achieves an accuracy of 88.380% and an F1 score of 85.647% while the proposed two-stream method reaches an accuracy of 90.799% and F1 score of 90.176. Thus, the proposed model improves the performance on this task by an absolute margin of 4.209% in averaged class-wise F1 score.

The experimental results have demonstrated that the proposed two-stream method performs superior compared to a baseline single-stream method and the state-of-the-art I3D model. The results of the proposed method are significantly better in the lower resolution videos. In this section, we intend to point out a few limitations in our experiments. Firstly, we used the ResNeXt-50 feature extractor, which is not the best nor the deepest feature extractor for this kind of task. We believe that ResNet-152 [27] or DenseNet-201 [30] or Wide ResNet-101 [28] may have potential for further improved performance. However, due to computational limitations and for the sake of experimental comparison between the twostream and single-stream methods, we chose to use the smaller ResNeXt-50 network. Secondly, we want to note that the two-stream method performance in the high resolution videos (224 × 224) are sub-optimal. The reason is that we were able to use a large mini-batch size while experimenting with the smaller frame sizes (64 × 64 and 112 × 112) and we also were able to train the feature extractor in the final phase of our training. However, due to computational limitations, a similar training scheme was not possible for the two-stream method.

We also believe that our results would improve further if we could annotate our video frames for the class-relevant object masks. In this way, our segmentation module could be fine-tuned on the FPV-O data providing improved color-coded masks. Since we had to use a pre-trained Mask R-CNN model for our segmentation module, we were not able to evaluate its performance on the FPV-O dataset.

In this study, we have developed SegCodeNet, a two-stream network that uses color-coded semantic segmentation maskbased video stream in addition to the conventional RGB video stream for activity classification from wearable cameras. A pre-trained Mask R-CNN model was used to generate specific Fig. 8 : Variation of the frame-wise attention values for a series of sub-sampled frames obtained from a video segment containing the chat activity class. Both the RGB and the mask frames are shown for illustration. In the mask frames, the relevant objects person, water flask and computer screen are detected and color-coded with white, blue and red, respectively. The frame-wise attention values increase when the person is looking towards the camera for talking.

colored masks for the important objects for each action class. The feature vector from this mask stream was concatenated with the feature vector of the RGB stream and provided to a bi-directional LSTM for activity classification. Our system provided a superior performance compared to two state-ofthe-art baseline models, including a single-stream method and a hybrid optical-flow based model. The proposed method performs significantly better compared to the single-stream method for lower resolution videos. We have also included stream-wise and frame-wise attention modules that further improves the performance.

Quo vadis, action recognition? a new model and the kinetics dataset

Compact cnn for indexing egocentric videos

Visual features for ego-centric activity recognition: A survey

Different approaches for human activity recognition: A survey

Object interaction detection using hand posture cues in an office setting

Ucf101: A dataset of 101 human actions classes from videos in the wild

HMDB: a large video database for human motion recognition

The thumos challenge on action recognition for videos in the wild

The kinetics human action video dataset

A short note about kinetics-600

A short note on the kinetics-700 human action dataset

Wearable hand activity recognition for event summarization

Temporal segmentation and activity classification from first-person sensing

Understanding egocentric activities

Novelty detection from an ego-centric perspective

Temporal segmentation of egocentric videos

Robust multi-dimensional motion features for first-person vision activity recognition

Hierarchical modeling for first-person vision activity recognition

Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice

Action recognition with improved trajectories

Action and event recognition with fisher vectors on a compact feature set

Action recognition with stacked fisher vectors

A review of human activity recognition methods

Action recognition in video sequences using deep bi-directional lstm with cnn features

Simple yet efficient real-time posebased action recognition

Imagenet classification with deep convolutional neural networks

Deep residual learning for image recognition

Wide residual networks

Aggregated residual transformations for deep neural networks

Densely connected convolutional networks

Xception: Deep learning with depthwise separable convolutions

Bidirectional recurrent neural networks

Openpose: realtime multi-person 2d pose estimation using part affinity fields

Contextual action cues from camera sensor for multi-stream action recognition

Rehar: Robust and efficient human activity recognition

Coupling eye-motion and ego-motion features for first-person activity recognition

Rpan: An end-to-end recurrent poseattention network for action recognition in videos

Hand segmentation for gesture recognition in ego-vision

Determining optical flow

Flownet 2.0: Evolution of optical flow estimation with deep networks

Potion: Pose motion representation for action recognition

An efficient optical flow based motion detection method for non-stationary scenes

Mask r-cnn

Ieee video and image processing cup (vip cup)

A first-person vision dataset of office activities

Microsoft coco: Common objects in context

Inception-v4, inception-resnet and the impact of residual connections on learning

The elephant in the room

Tv-l1 optical flow estimation

3d convolutional neural networks for human action recognition

Batch normalization: Accelerating deep network training by reducing internal covariate shift

One weird trick for parallelizing convolutional neural networks

On the variance of the adaptive learning rate and beyond

Imagenet: A large-scale hierarchical image database

Approximate statistical tests for comparing supervised classification learning algorithms