key: cord-0058275-63htwpif
authors: Fugošić, Kristijan; Šarić, Josip; Šegvić, Siniša
title: Multimodal Semantic Forecasting Based on Conditional Generation of Future Features
date: 2021-03-17
journal: Pattern Recognition
DOI: 10.1007/978-3-030-71278-5_34
sha: 8f811f4a0c18b884b79c1aaa8bd9dc73d27166da
doc_id: 58275
cord_uid: 63htwpif

This paper considers semantic forecasting in road-driving scenes. Most existing approaches address this problem as deterministic regression of future features or future predictions given observed frames. However, such approaches ignore the fact that future can not always be guessed with certainty. For example, when a car is about to turn around a corner, the road which is currently occluded by buildings may turn out to be either free to drive, or occupied by people, other vehicles or roadworks. When a deterministic model confronts such situation, its best guess is to forecast the most likely outcome. However, this is not acceptable since it defeats the purpose of forecasting to improve security. It also throws away valuable training data, since a deterministic model is unable to learn any deviation from the norm. We address this problem by providing more freedom to the model through allowing it to forecast different futures. We propose to formulate multimodal forecasting as sampling of a multimodal generative model conditioned on the observed frames. Experiments on the Cityscapes dataset reveal that our multimodal model outperforms its deterministic counterpart in short-term forecasting while performing slightly worse in the mid-term case. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this chapter (10.1007/978-3-030-71278-5_34) contains supplementary material, which is available to authorized users.

Self-driving cars are today's burning topic [27] . With their arrival, the way that we look at passenger and freight traffic will change forever. But in order to solve such a complex task, we must first solve a series of "simpler" problems. One of the most important elements of an autonomous driving system is the ability to recognize and understand the environment [2, 7] . It is very important that the system is able to recognize roads, pedestrians moving along or on the pavement, other cars and all other traffic participants. This makes semantic segmentation a very popular problem [26, 30, 31] .

However, the ability to predict the future is an even more important attribute of intelligent behavior [16, 21, [23] [24] [25] . It is intuitively clear that critical real-time systems such as autonomous driving controllers could immensely benefit from the ability to predict the future by considering the past [4, 17, 27] . Such systems could make much better decisions than their counterparts which are able to perceive only the current moment. Unfortunately, this turns out to be a very hard problem. Most of the current work in the field approaches it very conservatively, by forecasting only unimodal future [16, 22] . However, this approach makes an unrealistic assumption that the future is completely determined by the past, which makes it suitable for guessing only the short-term future. Hence, deterministic forecasting approaches will be prone to allocate most of its forecasts to instances of common large classes such as cars, roads, sky and similar. On the other side, such approaches will often underrepresent smaller objects. When it comes to signs, poles, pedestrians or some other thin objects, it makes more sense for a conservative model to allocate more space to the background than to risk classifying them. Additionally, future locations of dynamic and articulated objects such as pedestrians or domestic animals would also be very hard to forecast by a deterministic approach.

In order to address problems of unimodal forecasting, this work explores how to equip a given forecasting model with somewhat more freedom, by allowing and encouraging prediction of different futures. Another motivation for doing so involves scenarios where previously unseen space is unoccluded. Such scenarios can happen when we are turning around a corner or when another car or some larger vehicle is passing by. Sometimes we can deduce what could be in that new space by observing recent past, and sometimes we simply can't know. In both cases, we would like our model to produce a distribution over all possible outcomes in a stochastic environment [1, 4, 9, 17] . We will address this goal by converting the basic regression model into a conditional generative model based on adversarial learning [18] and moment reconstruction losses [12] .

Dense Semantic Forecasting. Predicting future scene semantics is a prominent way to improve accuracy and reaction speed of autonomous driving systems. Recent work shows that direct semantic forecasting is more effective than RGB forecasting [16] . Further work proposes to forecast features from an FPN pyramid by multiple feature-to-feature (F2F) models [15] . This has recently been improved by single-level F2F forecasting with deformable convolutions [19, 20] .

Multimodal Forecasting. Future is uncertain and multimodal, especially in long-term forecasting. Hence, forecasting multiple futures is an interesting research goal. An interesting related work forecasts multi-modal pedestrian trajectories [9] . Similar to our work, they also achieve multimodality through a conditional GAN framework. Multi-modality has also been expressed through mixture density networks [17] in order to forecast egocentric localization and emergence prediction. None of these two works consider semantic forecasting.

To the best of our knowledge, there are only a few works in multimodal semantic forecasting, and all these works are either very recent [1] or concurrent [17] . One way to address multimodal semantic forecasting is to express inference within a Bayesian framework [1] . However, Bayesian methods are known for slow inference and poor real-time performance. Multi-modality can also be expressed within a conditional variational framework [4] , by modelling interaction between the static scene, moving objects and multiple moving objects. However, the reported performance suggests that the task is far from being solved.

GANs with Moment Reconstruction. GANs [3] and their conditional versions [18] have been used in many tasks [6, 11, 14, 28] . However, these approaches lack output diversity due to mode collapse. Recent work alleviates this problem with moment reconstruction loss [12] which also improves the training stability.

Improving Semantic Segmentation with Adversarial Loss. While most GAN discriminators operate on raw image level, they can also be applied to probabilistic maps. This can be used either as a standalone loss [8] or as a regularizer of the standard cross entropy loss [5] .

Generative adversarial models [3] are comprised of two neural networks -a generator and a discriminator. Each of them has its own task and separate loss function. The goal of a generator is to produce diverse and realistic samples, while discriminator classifies given sample as either real (drawn from the dataset) or fake (generated). By conditioning the model it is possible to direct the data generation process. Generative adversarial networks can be extended to a conditional model if both the generator and discriminator are conditioned on some additional information [18] . Additional information can be of any kind, in our case it's a blend of features extracted from past frames.

However, both standard GAN and its conditional version are highly unstable to train. To counter the instability, most conditional GANs for image-to-image translation [6] use reconstruction (l1/l2) loss in addition to the GAN loss. While reconstruction loss forces model to generate samples similar to ground-truth, it often results in mode collapse. Mode collapse is one of the greatest problems of generative adversarial models. While we desire diverse outputs, mode collapse manifests itself as one-to-one mapping. This problem can be mitigated by replacing the traditional reconstruction loss with moment reconstruction (MR) losses which increase training stability and favour multimodal output generation [12] .

The main idea of MR-GAN [12] is to use maximum likelihood estimation loss to predict conditional statistics of the real data distribution. Specifically, MR-GAN estimates the central measure and the dispersion of the underlying distribution, which correspond to mean and variance in the Gaussian case.

Overall architecture of MR-GAN is similar to conditional GANs, with two important novelties: 1. Generator produces K different samplesŷ 1:K for each image x by varying random noise z 1:K . 2. Loss function is applied to the sampled moments (mean and variance) in contrast to the reconstruction loss which is applied directly on the samples.

They estimate the moments of the generated distribution as follows:

MR loss is calculated by pluggingμ andσ 2 in Eq. 1. The loss thus obtained is called MR2, while they denote a loss that does not take into account variance with MR1.

For more stable learning, especially at an early stage, the authors suggest a loss called Proxy Moment Reconstruction (proxy MR) loss. As It was shown in [12] that MR and proxy MR losses achieve similar results on Pix2Pix [6] problem, we will use simpler MR losses for easier, end-to-end, training.

Most of the previous work in forecasting focuses on predicting raw RGB future frames and subsequent semantic segmentation. Success in that area would be a significant achievement because it would make possible to train on extremely large set of unmarked learning data. However, problems such as autonomous driving require the program to recognize the environment on a semantically meaningful level. In that sense forecasting on RGB level is an unnecessary complication. As many attempts in feature-to-feature forecasting were based on semantic segmentation, in [15] they go a step further and predict the semantic future at the instance level. This step facilitates understanding and prediction of individual objects trajectories. The proposed model shares much of the architecture with the Mask R-CNN, with the addition of predicting future frames. Since the number of objects in the images varies, they do not predict the labels of the objects directly. Instead, they predict convolutional features of fixed dimensions. Those features are then passed through the detection head and upsampling path to get final predictions.

Sarić et al. in their paper [19] proposed a single-level F2F model with deformable convolutions. The proposed model, denoted as DeformF2F, brings few notable changes compared to [15] 

1. Single-level F2F model which performs on last, spatially smallest, resolution 2. Deformable convolutions instead of classic or dilated ones 3. Ability to fine tune two separately trained submodels (F2F and submodel for semantic segmentation).

DeformF2F achieves state-of-the-art performance on mid-term (t + 9) prediction, and second best result on short-term (t + 3) prediction.

We use modified single-level F2F forecasting model as a generator, customized PatchGAN [13] as a discriminator, while MR1 and MR2 losses are used in order to achieve diversity in predictions. We denote our model as MM-DeformF2F (Multimodal DeformF2F) (Fig. 1 Generator. Generator is based on DeformF2F model. In order to generate diverse predictions, we introduce noise in each forward pass. Gaussian noise tensor has 32 channels and fits the spatial dimensions of the input tensor. Instead of one, we now generate K different predictions with the use of K different noise tensors. Generator is trained with MR and GAN loss applied to those predictions.

Discriminator. According to the proposal from [12] , we use PatchGAN as a discriminator. Since input features of our PatchGAN are of significantly smaller spatial dimensions, we use it in a modified form with a smaller number of convolutional layers. Its purpose is still to reduce the features to smaller regions, and then to judge each region as either fake (generated) or real (from dataset). Decisions across all patches are averaged in order to bring final judgment in the form of 0 to 1 scalar. Since the discriminator was too dominant in learning, we introduced dropout in its first convolutional layer. In general, we shut down between 50 and 65% of features.

Following the example of [19] , we use video sequences from the Cityscapes dataset. The set contains 2975 scenes (video sequences) for learning, 500 for validation and 1525 for testing with labels for 19 classes. Each scene is described with 30 images, with a total duration of 1.8 s. That means that dataset contains a total of 150,000 images with resolution 1024 × 2048 pixels. Groundtruth semantic segmentation is available for the 20th image of each scene. Since introduction of GAN methods made our model more complicated, all images from the dataset were halved in width and height in order to reduce the number of features and speed up training.

Training Procedure. In last paragraph we described the dataset and the initial processing of the input data. If we denote the current moment as t, then in short-term prediction we use convolutional features at moments t − 9, t − 6, t−3 in order to predict the semantic segmentation at moment t+3, or at moment t + 9 for mid-term forecasting. Features have spatial dimensions 16 × 32 and 128 channels. Training can be divided into two parts. First, we jointly train feature extractor and upsampling branch with cross entropy loss [10, 19] . All images later used for training are passed through feature extracting branch and the resulting features are stored on SSD drive. We later load those features instead of passing through feature extractor, as that saves us time in successive training and evaluation of the model. In the second part, we train the F2F model in an unsupervised manner. Unlike [19] , instead of L2 loss we use MR loss and GAN loss. We give a slightly greater influence to the reconstruction loss (λ MR = 100) than adversarial (λ GAN = 10). For both the generator and the discriminator, we use Adam optimizer with a learning rate of 4 · 10 −4 and decay rates 0.9 and 0.99 for the first and second moment estimates, respectively. We reduce the learning rate using cosine annealing without restart to a minimum value of 1·10 −7 . To balance the generator and the discriminator, we introduce dropout in first convolutional layer of the discriminator. As an example, training short-term forecasting task with MR1 loss without dropout begins to stagnate as early as the fortieth epoch, where mIoU is 1.5 to 2% points less than the best results achieved.

We show average metrics across 3 trained models and multiple evaluations for each task. In every forward pass we generate 8 predictions. We use mIoU as our main metric for accuracy, while MSE and LPIPS are used to express diversity as explained below.

Mean Squared Error is our main diversity metric. We measure Euclidean distance on pixel level between every two generated predictions for each scene, and take mean over whole dataset.

Following the example of [12] , we also use LPIPS (Learned Perceptual Image Patch Similarity [29] ) to quantify the diversity of generated images. In [29] they have shown that deep features can be used to describe similarity between two images while outperforming traditional measures like L2 or SSIM. We measure LPIPS between every two generated predictions for each scene, and take mean over whole dataset. Since we don't generate RGB images, but instead predict semantic future which has limited structure, MSE has proved to be sufficient measure for diversity.

In addition to the numerical results, in following subsections we will also show generated predictions, as well as two gray images: a) Mean logit variance logits.var(dim=0).mean(dim=0) b) Variance of discrete predictions

An example of gray images is shown in Fig. 2 . The first gray image highlights areas of uncertainty, while on the second image we observe areas that are classified into different classes on different generated samples.

We conducted experiments on short-term and mid-term forecasting tasks with roughly the same hyperparameters. With MR1 loss we observe mIoU which is on par with baseline model, and slightly greater in the case of short-term forecasting. On the other hand, using MR2 loss resulted in lower mIoU, but predictions are Fig. 2 . Shown in the following order: future frame and its ground truth segmentation, mean logit variance and variance of discrete predictions. The first gray image highlights areas of uncertainty, while on the second gray image we see areas that are classified into different classes on different generated samples. The higher the uncertainty, the whiter the area.

a lot more diverse compared to MR1. Although we could get higher mIoU on mid-term forecasting with minimal changes in hyperparameters at the cost of diversity, we do not intervene because mIoU is not the only relevant measure in this task. Accordingly, although MR2 lowers mIoU by 5 or more percentage points, we still use it because of the greater variety. Visually most interesting predictions were obtained by using MR2 on short-term forecasting task. One of those is shown on Fig. 3 . Notice that people are visible in the first frame and obscured by the car in the last frame. In the future moment, the car reveals the space behind it, and for the first time our model predicts people in correct place (second row, first prediction). We failed to achieve something like that when using MR1 loss or with baseline model. Such predictions are possible with the MR2 loss, but still rare, as in this particular case model recognized people in the right place in only one of the twelve predictions. Fig. 3 . Short-term forecasting with MR2 loss. Row 1 shows the first and the last input frame, the future frame, and its ground-truth segmentation. Row 2 shows 4 out of 12 model predictions.

Our main results are shown in Tables 1 and 2 . Since results in [19] were obtained on images of full resolution, we retrained their model on images with halved height and width. In tables we show average mIoU and mIoU-MO (Moving Objects) across five different models. We also show results achieved with Oracle, single-frame model used to train the feature extractor and the upsampling path, which "predicts" future segmentation by observing a future frame. While oracle represents upper limit, Copy last segmentation can be seen as lower bound, or as a good difficulty measure for this task. We get a slightly better mIoU if we average the predictions, although this contradicts the original idea of this paper. Like in [1] , we also observe slight increase of mIoU when comparing top 5% to averaged predictions. Performance boost is best seen when we look at moving objects accuracy (mIoU-MO) while using MR1 loss, as we show in more detail in supplementary. Table 3 shows the impact of the number of generated predictions (K) on their diversity and measured mIoU. We can see that larger number of generated predictions contributes to greater diversity, while slightly reducing mIoU. In training, we use K = 8 because of the acceptable training time and satisfactory diversity. Training with K = 16 would take about twice as long. We discuss memory overhead and evaluation time in supplementary material. Table 3 . Impact of the number of generated samples (K) on mIoU and diversity, measured on mid-term prediction task and MR1 loss. We can see that a larger number of generated samples contributes to greater diversity and slightly reduces mIoU. Testing was performed at an early stage of the paper, and the results are somewhat different from those in the Table 2 

To show that the output diversity is not only due to the use of moment reconstruction losses and random noise, we trained the model without adversarial loss (λ GAN = 0). In this experiment we were using MR1 loss on short-term forecasting task and with generator producing 8 predictions for each input image. Although some diversity is visible at an early stage (MSE around 0.7), around the 12th epoch the diversity is less and less noticeable (MSE around 0.35), and after the 40th it can barely be seen (MSE around 0.1). The model achieved its best mIoU 59.18 in epoch 160 (although it was trained on 400 epochs), and the MSE measure in that epoch was 0.08. Figure 4a shows the improvement in performance through the epochs, but with gradual weakening of diversity. For this example we chose an image with a lot of void surfaces, due to the fact that greatest diversity is usually seen in those places. On Fig. 4b we show the same scene, but predictions were obtained on a model trained with weights λ GAN = 10 and λ MR = 100. It took the model 232 epochs to achieve its best mIoU which is 59.14, but MSE held stable above 1 until the last, 400th, epoch. 

We have seen that averaging generated predictions before grading them increases mIoU -from 0.2 up to 3% points, depending on the task. Therefore, we propose a novel metric for measuring plausibility of multimodal forecasting. The proposed metric measures percentage of pixels that were correctly classified at least once through multiple forecasts. We distinguish three cases by looking at:

1. Every pixel except void class 2. Only pixels of movable objects 3. Only pixels that were correctly classified by Oracle.

We measure at multiple checkpoints (1, 2, 4, ..., 128) and present the obtained results in Fig. 5 . On Fig. 5 we compare short term forecasting using MR1 and MR2 losses, while in supplementary material we also show additional line which represents best so far prediction. 5 . Number of future pixels that were correctly classified at least once depending on the number of forecasts. We show three different cases with lines of different colors, as described by legend. Full lines represent MR2 short-term model, while dashed lines represent results with MR1 short-term model.

We have presented a novel approach for multimodal semantic forecasting in road driving scenarios. Our approach achieves multi-modality by injecting random noise into the feature forecasting module. Hence, the feature forecast module becomes a conditional feature-to-feature generator which is trained by minimizing the moment reconstruction loss, and by maximizing the loss of a patch-level discriminator. Both the generator and the discriminator operate on abstract features.

We have also proposed a novel metric for measuring plausibility of multimodal forecasting. The proposed metric measures the number of forecasts required to correctly guess a given proportion of all future pixels. We encourage the metric to reflect forecasting performance by disregarding pixels which are not correctly guessed by the oracle.

The inference speed of our multi-modal model is similar to the uni-modal baseline. Experiments show that the proposed setup is able to achieve considerable diversity in mid-term forecasting. MR2 loss brings more diversity compared to MR1, however it reduces mIoU by around 5% points. Inspection of the generated forecasts, reveals that the model is sometimes still hesitating to replace close and large objects, but it often accepts to take a chance on close and dynamic or small and distant objects, like bikes and pedestrians.

In the future work we shall consider training with proxy MR1 and proxy MR2 losses. We should also consider using different discriminator, for example the one with global contextual information. Also, one of the options is to try concatenating features with their spatial pools prior to the discriminator. Other suitable future directions include evaluating performance on the instance segmentation task and experimenting with different generative models.

Bayesian prediction of future street scenes using synthetic likelihoods

Leveraging semi-supervised learning in video sequences for urban scene segmentation

Generative adversarial networks

Probabilistic future prediction for video scene understanding

Adversarial learning for semisupervised semantic segmentation

Image-to-image translation with conditional adversarial networks

Panoptic segmentation

Adversarial networks for the detection of aggressive prostate cancer

Social-BiGAT: multimodal trajectory forecasting using bicycle-GAN and graph attention networks

Efficient ladder-style DenseNets for semantic segmentation of large images

Photo-realistic single image super-resolution using a generative adversarial network

Harmonizing maximum likelihood with GANs for multimodal conditional generation

Precomputed real-time texture synthesis with Markovian generative adversarial networks

Dual motion GAN for future-flow embedded video prediction

Predicting future instance segmentation by forecasting convolutional features

Predicting deeper into the future of semantic segmentation

Multimodal future localization and emergence prediction for objects in egocentric view with a reachability prior

Conditional generative adversarial nets

Single level featureto-feature forecasting with deformable convolutions

Warp to the future: joint forecasting of features and feature motion

Predicting behaviors of basketball players from first person videos

Predicting future instance segmentation with contextual pyramid ConvLSTMs

Recurrent flow-guided semantic forecasting

Anticipating the future by watching unlabeled video

One-step time-dependent future video frame prediction with a convolutional encoder-decoder neural network

DenseASPP for semantic segmentation in street scenes

Egocentric vision-based future vehicle localization for intelligent driving assistance systems

Generative image inpainting with contextual attention

The unreasonable effectiveness of deep features as a perceptual metric

Pyramid scene parsing network

Learning fully dense neural networks for image semantic segmentation