key: cord-0171233-33y9czkg
authors: Nishimura, Mai; Nobuhara, Shohei; Nishino, Ko
title: View Birdification in the Crowd: Ground-Plane Localization from Perceived Movements
date: 2021-11-09
journal: nan
DOI: nan
sha: 7489888896b7f2ee5f781a3f65be74b45a811498
doc_id: 171233
cord_uid: 33y9czkg

We introduce view birdification, the problem of recovering ground-plane movements of people in a crowd from an ego-centric video captured from an observer (e.g., a person or a vehicle) also moving in the crowd. Recovered ground-plane movements would provide a sound basis for situational understanding and benefit downstream applications in computer vision and robotics. In this paper, we formulate view birdification as a geometric trajectory reconstruction problem and derive a cascaded optimization method from a Bayesian perspective. The method first estimates the observer's movement and then localizes surrounding pedestrians for each frame while taking into account the local interactions between them. We introduce three datasets by leveraging synthetic and real trajectories of people in crowds and evaluate the effectiveness of our method. The results demonstrate the accuracy of our method and set the ground for further studies of view birdification as an important but challenging visual understanding problem.

We, human beings, are capable of mentally visualizing our surroundings in a third-person view. Imagine walking down a street alongside other pedestrians. Your mental model of the movements of surrounding people is not a purely two-dimensional one, but rather in 3D, albeit imperfect. It lets you guess your present location and how the geometric layout of your surroundings changes as you navigate even in a dense crowd where everything around you is dynamic. Endowing such 3D spatial perception with computers remains elusive. Despite the significant progress in computational 3D and motion perception, structure from motion, and SLAM, reconstructing the 3D geometry and motion in an "everywhere-dynamic" scene is still challenging. Past works fundamentally rely on the visibility of textured background to extract static keypoints at all times, so that the ego-motion can be estimated regardless of surrounding movements. Figure 1 : View birdification aims to recover the ground-plane trajectory of people in a crowd from an ego-centric video captured by a dynamic observer without static references.

In this paper, we ask a fundamental question of 3D computational perception. Can we recover our own and surrounding movements on the ground plane from their perceived movements in the image plane of our view, when we can't easily discern our ego-motion? That is, given 2D ego-centric views from an agent moving in a dynamic environment consisting of other moving agents, can we localize all agents on the ground plane without requiring that static background is visible in the images? We refer to this problem as view birdification in a crowd, the problem of computing a bird's-eye view of the movements of surrounding people from a single dynamic ego-centric view (see Fig. 1 ). Note that our focus is on the movements, not the appearance for which recent work has introduced various approaches. The need for view birdification frequently arises in a wide range of vision tasks when, for instance, a person is walking in a dense dynamic crowd where ego-centric views of the surrounding are limited, making static reference requirements unrealistic. A robust method to this key question would bring us a large step forward towards robust robot navigation and situational awareness in the wild, and also expand the horizon of surveillance.

We introduce a purely geometric approach to view birdification. The method only requires 2D bounding boxes of people in the ego-view and would generalize to different appearances. Our method is based on two key insights. First, the movement of the pedestrians are not arbitrary, but exhibit coordinated motion that can be expressed with crowd flow models [15, 33] . That is, the interaction of pedestrians' movements in a crowd can be locally described with analytic or data-driven models. Second, the scale and difference of human heights are proportional to estimated geometric depth [25] . In other words, the positions of pedestrians on the ground plane can be constrained along the lines that pass through a center of projection. These insights lend us a natural formulation of view birdification as a geometric reconstruction problem. We formulate view birdification as a cascaded optimization problem, consisting of camera ego-motion estimation constrained by predicted pedestrian motion, and pedestrian localization given the ego-motion estimate. We solve this with a cascaded optimization consisting of gradient descent and combinational optimization under the projection constraints and the assumed interaction model.

We experimentally validate our method on synthetic ego-centric views of people walking on trajectories extracted from publicly available crowd datasets. Since our method is appearance agnostic, these datasets exactly correspond to reality except for possible errors in bounding box extraction (i.e., multi-object tracking). To evaluate the end-to-end accuracy including tracking errors, we create a photorealistic crowd dataset that simulates real camera projection with limited field of views and occluded pedestrian observations while moving in the crowd. These datasets allow us to quantitatively evaluate our method systematically and set the stage for further studies on view birdification. Experimental results demonstrate the effectiveness of our approach for view birdification in crowds of various densities.

Our contributions are threefold: (i) the introduction of view birdification, the simultaneous recovery of ground-plane trajectories of surrounding pedestrians and that of the observation camera just from an ego-centric view, as a novel research problem, (ii) the deriva-tion of a cascaded optimization framework with a Bayesian formulation to solve the view birdification problem, and (iii) the construction of view birdification datasets consisting of paired real human trajectories and synthetic ego-views. We believe view birdification finds a wide range of applications and these contributions have strong implications in computer vision and robotics as they establish view birdification as a foundation for downstream visual understanding applications including crowd behavior analysis [1, 2, 12] , self-guidance [19, 20] , and robot navigation [32, 39] .

To our knowledge, our work is the first to formulate and tackle view birdification which finds relevance to several fundamental computer vision and robotics problems.

Bird's Eye View Transformation Conceptually, view birdification may appear similar to bird's eye view (BEV) synthesis. These two are fundamentally different in three critical ways. First, view birdification concerns the movements not the appearance, in contrast to BEV synthesis [34, 41, 42, 51, 52] or cross-view association [3, 4, 38] . Second, unlike most BEV methods [24, 31, 40] , view birdification cannot rely on ground plane keypoints, multi-view images, or paired images between the views as they are usually not available in crowded scenes. Also note that, in crowded scenes, the ground plane and footsteps cannot be clearly extracted, which makes simple homography-based approaches impossible. Third, view birdification aims to localize all agents in a single coordinate frame across time, unlike BEV which is relative to the observer's location at each time instance [6, 28, 50] . As such, BEV synthesis methods are not directly applicable to view birdification.

Dynamic SLAM. View birdification can be considered as a dynamic SLAM problem in which all points, not just the observer but also the scene itself, are dynamic. Typical approaches to dynamic SLAM explicitly track and filter dynamic objects [8, 49] or implicitly minimize outliers caused by the dynamic objects [13, 14, 26] . In contrast to these approaches that sift out static keypoints from dynamic ones, methods that leverage both static and dynamic keypoints by, for instance, constructing a Bayesian factor graph [16, 17, 23] have also been introduced. The success of most of these approaches, however, depends on static keypoints which are hard to find and track in cluttered dynamic scenes such as in a dense crowd. In view birdification, we require no static keypoints and can reconstruct both ego-motion and surrounding dynamics only from the observed motions in the ego-view.

Crowd Modeling Modeling human behavior in crowds is essential for a wide range of applications including crowd simulation [21] , trajectory forecasting [1, 12, 18] , and robotic navigation [2, 32, 39] . Popular approaches include multi-agent interactions based on social force models [2, 15, 29] , reciprocal force models [43] , and imitation learning [39] . Recently, data-driven approaches have achieved significant performance gains on public crowd datasets [1, 12, 18] . All these approaches, however, are only applicable to near top-down views. Forecasting future location of people from first-person viewpoints has also been explored [27, 48] , but they are limited to localization in the image plane. View birdification may provide a useful foundation for these crowd modeling tasks.

A typical scenario for view birdification is when a person with a body-worn camera is immersed in a crowd consisting of people heading towards their destinations while implicitly interacting with each other. Our goal is to deduce the global movements of people from the local observations in the ego-centric video captured by a single person.

As a general setup, we assume that K people are walking on a fixed ground plane and an observation camera is mounted on one of them. We set the z-axis of the world coordinate system to the normal of the ground plane and denote the on-ground location of the k th pedestrian as x x x k = [x k , y k ] . Let us denote the location of 0 th person in the crowd x x x 0 as the observer capturing the ego-centric video of pedestrians k ∈ {1, 2, . . . , K} who are visible to the observer. The observation camera is located at [x 0 , y 0 , h 0 ] , where the mounted height h 0 is constant across the frames. We assume that the viewing direction is parallel to the ground plane, e.g., the person has a camera mounted on the shoulder. The same assumption applies when the observer is a vehicle or a mobile robot. At each timestep t, the pedestrians are observed by a camera with pose [R|x x x 0 ] t , where we assume 2D rotation and translation on the ground plane i.e., R ∈ SO(2) and x x x 0 ∈ R 2 , respectively.

We assume that bounding boxes of the people captured in the ego-video are already extracted. For this, we can use an off-the-shelf multi-object tracker [46, 47] which provides the state of each pedestrian on the image plane s s s t k = u t k , v t k , l t k which consists of the projections of center location and height, p p p t k = u t k , v t k and l t k , respectively. Note that our method is agnostic to the actual tracking algorithm. Pedestrian IDs k ∈ {1, 2, . . . , K} can also be assigned by the tracker. Given a sequence of pedestrian states S k from the first visible frame τ 1 to the last visible frame τ 2 , i.e. , S τ 1 :τ 2 k = {s s s τ 1 k , s s s τ 1 +1 k , . . . , s s s τ 2 k }, our goal is to simultaneously reconstruct the K trajectories of the surrounding pedestrians X τ 1 :

. . , x x x τ 2 k } and that of the observation camera X τ 1 :

, . . . , x x x τ 2 0 } with its viewing direction R τ 1 :τ 2 = {R τ 1 , R τ 1 +1 , . . . , R τ 2 } on the ground plane.

In the following, we set the z-axis of the world coordinate system to the normal of the ground plane (x-y plane). Let us denote rotation angles about the x-, y-, and z-axis with θ x , θ y , and θ z , respectively. Assuming that the viewing direction of the camera is stabilized and parallel to the ground plane, we can approximate the rotation angles about the xand y-axis to be ∆θ x = 0 and ∆θ y = 0 across the frames. That is, the camera pose to be estimated is represented by its rotation R z (∆θ z ) ∈ SO(2) and translation ∆x x x 0 ∈ R 2 on the ground plane.

We assume a regular perspective ego-centric view, but the following derivation also applies to other projection models including generic quasi-central cameras for fish-eye lens [9] . In the case of perspective projection with focal length f and intrinsic matrix A ∈ R 3×3 , the distance of the pedestrian from the observer is proportional to the ratio of the pedestrian height h k and its projection l k , i.e., h k /l k . Given the footpoint of the pedestrian in the image plane s s s k = [u k , 0, l k ], the on-ground location estimate of the pedestrian relative to the camera z z z k = [x k ,ỹ k , 0] can be computed by inverse projection of the observed image coordinates,

where the intrinsics A and focal length f are known since the observation camera can be calibrated a priori. The relative coordinates z z z k are thus scaled by the unknown pedestrian height parameter h k . The absolute position of the pedestrian x x x k = [x k , y k ] can be computed by the relative coordinates z z z k = [x k ,ỹ k ] , the camera position x x x 0 = [x 0 , y 0 ] , and the viewing direction θ z ,

Given a sequence of states S τ 1 :

. . , s s s τ 2 k }, we obtain corresponding onground location estimates relative to the camera Z τ 1 :

. . , z z z τ 2 k } by inverse projection with unknown scale parameters using Eq. (1). The trajectories of pedestrians on the ground plane X τ 1 :

can be decomposed into the camera motion X τ 1 :τ 2 0 , R τ 1 :τ 2 and the relative positions Z τ 1 :τ 2 k of pedestrians centered around the camera position. Our goal is to recover the camera ego-motion X τ 1 :τ 2 0 , R τ 1 :τ 2 and the pedestrian trajectories

We derive a cascaded optimization approach to the geometric view birdification problem based on a Bayesian perspective.

When a frame is pre-processed to a set of states S t 1:K = {s s s t 1 , s s s t 2 , . . . , s s s t K } ∈ R 2×K at time t, we obtain a set of on-ground position estimates relative to a camera Z t 1:K = {z z z τ 1 , z z z τ 2 , . . . , z z z t K } ∈ R 2×K corresponding to the states S t 1:K . Assuming that we have sequentially estimated onground positions up to time t −1, X t−τ:t−1

Let ∆x x x t 0 = [∆x t 0 , ∆y t 0 , ∆θ t ] ∈ R 3 be the camera ego-motion from timestep t −1 to t consisting of a 2D translation [∆x 0 , ∆y 0 ] and a change in viewing direction ∆θ on the ground plane. The optimal motion of the camera ∆x x x t 0 and those of the pedestriansX t 1:

can be estimated as those that maximize the posterior distribution (Eq. (3)). The motion of observed pedestrians X t−1:t 1:K are strictly constrained by the observing camera position x x x t 0 and its viewing direction θ t . With recovered pedestrian parametersX t 1:K , the optimal estimate of the camera ego-motion ∆x

, ∆x x x t 0 ) are motion priors of the camera and pedestrians conditioned on the camera motion, respectively. If the observer camera is mounted on a pedestrian following the crowd flow, p(x x x t 0 |X t−τ:t−1 0 ) obeys the same motion model as p(x x x t k |X t−τ:t−1 k ). As in previous work for pedestrian detection [25] , we assume that the heights of pedestrians h k follow a Gaussian distribution. This lets us define the likelihood of observed pedestrian positions z z z t k relative to the camera x x x t 0 as

where N (µ h , σ 2 h ) is a Gaussian distribution with mean µ h and variance σ 2 h . Once the egomotion of the observing camera is estimated as ∆x x x t 0 , the pedestrian positionsX t 1:K that maximize the posterior p(X t 1:

That is, we can estimate the ego-motion of the observer constrained by the perceived pedestrian movements which conform to the crowd motion prior and the observation model.

Once the camera ego-motion is estimated, we can update the individual locations of pedestrians given the ego-motion in an iterative refinement process. View birdification can thus be solved with a cascaded optimization which first estimates the camera ego-motion and then recovers the relative locations between the camera and the pedestrians given the ego-motion estimate while taking into account the local interactions between pedestrians. Minimization of the negative log probabilities, Eqs. (4) and (6), can be expressed as

subject toX t 1:K = argmin

where we define the energy functions for positions of camera E c and pedestrians E c as

We minimize the energy in Eq. (7) by first computing an optimal camera positionx x x t 0 from Eq. (7) with gradient descent and initial state x x x t 0 = x x x t−1 0 . Given the estimate of the observer locationx x x t 0 , we then estimate the pedestrian locations by solving the combinatorial optimization problem in Eq. (8) for X t k while considering all possible combinations of {x x x t 1 , . . . , x x x t K } that satisfy the projection constraint in Eq. (1) and the assumed pedestrian interaction model. This can be interpreted as a fully connected graph consisting of K pedestrian nodes with unary potential and interaction edges with pairwise potential. Similar to prior works on low-level vision problems [5, 22] , Eq. (10) can be optimized by iterative message passing [11] on the graph. The possible states x x x i are uniformly sampled on the projection line around µ h with interval [µ h − δ S/2, µ h + δ S/2], where S is the number of samples and δ = 0.01. Considering only pairwise interactions and Gaussian potential, the complexity of the optimization is O(KS 2 T ), where T is the number of iterations required for convergence.

In this paper, we use two types of analytical interaction models, ConstVel [36] and Social Force [15] . We provide a detailed derivation of energy functions for these in the supplementary material.

We validate the effectiveness of the proposed geometric view birdification method through an extensive set of experiments. Unfortunately, the COVID-19 pandemic has made real data collection impossible as it would inevitably involve many people. Instead, we fully leverage existing real pedestrian trajectories combined with synthetic camera views to thoroughly evaluate the accuracy of our method. Since our method only requires bounding boxes of people in the ego-centric view, we can fully evaluate the effectiveness of our method in real scenes by using real trajectories.

To the best of our knowledge, no public dataset is available for evaluating view birdification (i.e., ego-video in crowds). We construct the following three datasets, which we will publicly disseminate, for evaluating our method and also to serve as a platform for further studies on view birdification. Please see supplementary material for detailed statistics of them.

Synthetic Pedestrian Trajectories The first dataset consists of synthetic trajectories paired with their synthetic projections to an observation camera. This data allows us to evaluate the effectiveness of view birdification when the crowd interaction model is known. The trajectories are generated by the social force model [15] with a varying number of pedestrians K ∈ {10, 20, 30, 40, 50}, and a perspective observation camera mounted on one of them. To evaluate the validity of our geometric formulation and optimization solution with this dataset, we assume ideal observation of pedestrians, i.e., pedestrians do not occlude each other and their projected heights can be accurately deduced from the observed images. We also assume that the pedestrians are extracted from the ego-centric video perfectly but their heights h k are sampled from a Gaussian distribution h k ∼ N (µ h , σ 2 h ) with mean µ h = 1.70 [m] and a standard deviation σ h ∈ [0.00, 0.07] [m] based on the statistics of European adults [44] .

The second dataset consists of real pedestrian trajectories paired with their synthetic projections to an observation camera. The trajectories are extracted from publicly available crowd datasets: three sets of sequences from ETH [33] and UCY [21] . As in the synthetic pedestrian trajectories dataset, we render corresponding egocentric videos from a randomly selected pedestrian's vantage point. With this, we obtain test sequences which we refer to as Univ, Hotel from ETH, and Students from UCY. Hotel, Univ, and Students datasets correspond to sparsely, moderately, and densely crowded scenarios, respectively. This dataset allows us to evaluate the effectiveness of our method on real data (movements).

The last dataset consists of synthetic trajectories paired with their photo-realistic projection captured with limited field of views and frequent occlusions between pedestrians. Evaluation on this dataset lets us examine the end-to-end Figure 2 : Results on synthetic pedestrian trajectories. Circle, star, and squared markers denote errors of estimated camera rotations ∆r, translations ∆t t t, relative ∆x x x and absolute localization errors ∆x x x, respectively, with standard deviations of pedestrian heights, σ = 0.01, 0.05, 0.07 [m], respectively. effectiveness of our method including robustness to tracking errors. Inspired by previous works on crowd analysis and trajectory prediction [10, 45] , we use the video game engine of Grand Theft Auto V (GTAV) developed by Rockstar North [35] with crowd flows automatically generated from programmed destinations with collision avoidance. We collected pairs of ego-centric videos with 90 • field-of-view and corresponding ground truth trajectories on the ground plane using Script Hook V API [37] . We randomly picked 50 different person models with different skin colors, body shapes, and clothes. We prepare two versions of this data, one with manually annotated centerline and heights of the pedestrians in the observed video frames and the other with those automatically extracted with a pedestrian detector [46] pretrained on MOT-16 [30] which includes data captured from a moving platform.

Evaluation Metric We quantify the accuracy of our method by measuring the differences between the estimated positions of the pedestrians x x x t k and the observer R t , x x x t 0 on the ground plane from their ground truth valuesẋ x x t k ,Ṙ t , andẋ x x t 0 , respectively. The translation error for the observer is ∆t t t = 1 T ∑ T x x x 0 t −ẋ x x t 0 , where T is a timestep duration of the sequence. The rotation error of the observer is ∆r r r = 1 T ∑ t arccos( 1 2 trace(R t (Ṙ t ) − 1). We also evaluate the absolute and relative reconstruction errors of surrounding pedestrians which are defined by Fig. 2 shows the view birdification results on the synthetic trajectories dataset. Although both rotation and translation errors slightly increase as the height standard deviation σ h becomes larger, the error rate becomes lower as the number of people K increases. This suggests that the more crowded, the more certain the camera position and thus the more accurate the birdification of surrounding pedestrians.

Results on unknown real interaction models The real trajectories data allow us to evaluate the accuracy of our method when the interactions between pedestrians are not known. We employ two pedestrian interaction models, Social Force (SF) [15] and ConstVel (CV) [36] . We first evaluate the accuracy of our view birdification (VB) using these models, referred to as VB-SF and VB-CV, and compare them with baseline prediction models. In these baseline models, referred to as ConstVel (CV) and Social Force (SF), we extrapolate a pedestrian position X t k from its past locations X t−2:t−1 k based on the corresponding interaction model without using the observer's ego-centric view. That is, the baseline model is not view birdification but extrapolation according to pre-defined motion models on the ground plane. Table 1 : Birdification results on real trajectories. Relative and absolute localization errors of pedestrians, ∆x x x, ∆x x x (top), and camera ego-motion errors, ∆r and ∆t t t (bottom), were computed for each frame for three different video sequences. Baseline methods only extrapolate movements on the ground plane resulting in missing entries (-). The results demonstrate the effectiveness of our view birdification.

Hotel / sparse Univ / mid Table 3 shows the errors of our method and baseline models. These results clearly show that our method, both VB-CV and VB-SF, can estimate the camera ego-motion and localize surrounding people more accurately, which demonstrates the effectiveness of birdifying the view and exploiting the geometric constraints on the pedestrians through it. VB-SF performs better than VB-CV especially in scenes with rich interactions such as Univ and Students, while they show similar performance on the Hotel dataset that includes less interactions. Both VB-SF and VB-CV show accurate camera ego-motion results in the Students dataset, which demonstrates the robustness of ego-centric view localization regardless of the assumed pedestrian interaction models. Our method achieves high accuracy on all three datasets across different standard deviations of heights σ h ∈ [0.00, 0.07]. This also shows that the method is robust to variation in human heights.

Photorealistic Crowds. Fig. 3 shows qualitative results on the photorealistic crowd dataset. As shown in the top two rows, our method accurately estimates camera ego-motion and onground positions of automatically detected pedestrians with an off-the-shelf tracker [46] . People tracked in more than three frames are birdified. Even with occlusions in the image and noisy height estimates computed from detected bounding boxes, our approach robustly estimates the camera ego-motion and surrounding pedestrian positions. Due to perspective projection, localization error caused by erroneous detection in the image plane is proportional to the ground-plane distance between the camera and the detected pedestrian. We further compared these results with manually annotated pedestrian heights as shown in the bottom two rows Fig. 3 to highlight the effect of automatically detecting the pedestrians for view birdification (i.e., to see how the results change if the pedestrian heights were accurate). The resulting accuracies are comparable, which demonstrates the end-to-end effectiveness. To further ameliorate the errors caused by detection noises, our method can also be extended, for instance, by replacing the noise model in Eq. (5) with a 2D Gaussian distribution. Please also see the supplemental material and video.

Input(t+20) Input(t+40) Input(t+60) Input(t+80) 

In this paper, we introduced view birdification, the problem of recovering the movement of surrounding people on the ground plane from a single ego-centric video captured in a dynamic cluttered scene. We formulated view birdification as a geometric reconstruction problem and derived a cascaded optimization approach that consists of camera ego-motion estimation and pedestrian localization while fully modeling the local pedestrian interactions. Our extensive evaluation demonstrates the effectiveness of our proposed view birdification method for crowds of varying densities. Currently, the occlusion handling is carried out by an external multi-object tracker. We envision a feedback loop from our birdification framework that can inform the multi-object tracker to reason better about the occluded targets, which will likely enhance the accuracy as a whole even in heavily occluded scenes. We believe our work has implications for both computer vision and robotics, including crowd behavior analysis, self-localization, and situational awareness, and opens new avenues of applications including dynamic surveillance.

In Section 4, we formulated view birdification as an iterative energy minimization problem that consists of a pedestrian interaction model p(x x x t k |X t−τ:t−1 k ) and a likelihood p(z z z t k |x x x t k , ∆x x x t 0 ) defined by the geometric observation model with ambiguities arising from human height estimates (Eq. (5)). Our framework is not limited to a specific pedestrian interaction model, and any type of model that explains pedestrian interactions in a crowd can be incorporated. In the following, we consider two example models with a temporal window of τ = 2.

Constant Velocity. ConstVel [36] is a simple yet effective model of pedestrian interactions in a crowd which simply linearly extrapolates future trajectories from the last two frames

The model is independent of other pedestrians and the overall pedestrian interaction model can be factorized as p(X t 1:K |X t−2:t−1

). The energy model E p is rewritten as

Social Force. The Social Force Model [15] is a well-known physics-based model that simulates multi-agent interactions with reciprocal forces, which is widely used in crowd analysis and prediction studies [29, 43] . Each pedestrian k with a mass m k follows the velocity dx x x/dt 2

where F F F k is the force on x x x k consisting of the personal desired force F F F p and the reciprocal force F F F r . The personal desired force is proportional to the discrepancy between the current velocity and that desired

where w w w k denotes the desired velocity which can be empirically approximated as the average velocity of neighboring pedestrians i ∈ N (x x x k ) [29] . The form of reciprocal force F F F r can be determined by the set of interactions between pedestrian nodes x x x i ∈ X C . To reduce the complexity of optimization, we approximate multihuman interaction F F F r (X C ) with a collection of pairwise interactions F F F r (x x x i , x x x k ). We assume a standard Gaussian potential to simulate the reciprocal force between two pedestrians

Without loss of generality, we omit m k as m k = 1, assuming that the mass of pedestrians in a crowd is almost consistent. Taking the last two frames as inputs, the complete pedestrian interaction model becomes p(X t 1:K |X t−2:t−1

Taking negative log probabilities, the overall energy model in Eq. (16) becomes

where the unary term and pairwise terms are

respectively.

We use the validation split of each crowd dataset [18] to find the optimal hyperparameters of the pedestrian interaction models. We set the weight parameter of the desired force F F F p to η = 0.5, and the variance of the Gaussian potential to σ 2 = 1.0 for the social force model. Fig. 4 visualizes typical example sequences from the synthetic dataset referred to as Sim and from the real trajectory dataset referred to as Hotel, Univ, and Students. In all of these datasets, a virtual observation camera is assigned to one of the trajectories and the observer captures the rest of the pedestrians in the sequence. Fig. 5 shows example trajectories of the GTAV dataset. The size of the ground field, where pedestrians are walking from starting points to their destinations, is configured to be 20m × 40m. We spawned 50 pedestrians starting from one of the four corners of the field, [−10, −10], [10, 10] , [10, −20] , [10, 20] , and set the opposite side of the field as their destinations. Both the starting points and destinations were randomized with a uniform distribution. In the GTAV dataset, an observation camera is mounted on one of the pedestrians walking in the crowd flow and we can obtain pairs of ground-truth trajectories and ego-centric videos with 90 • field -of-view via Script Hook V APIs [37] .

Hotel Univ Students Figure 4 : Typical example trajectories. Typical example trajectories from the datasets Sim, Hotel, Univ, and Students. In the Sim Example, the red triangle is the virtual camera that observes projected pedestrians on the image plane, where dashed gray lines denote the projection. 

In this paper, we constructed several datasets consisting of synthetic pedestrian trajectories (Sim), real pedestrian trajectories (Hotel, Univ, Students), and photorealistic crowd simulation (GTAV). These datasets are designed differently in several aspects (i.e., densities of a crowd, synthetic view or not, synthetic or real interaction models) for evaluation studies of our proposed view birdification method. Table 2 summarizes the statistics and taxonomy of these datasets.

In Section 5.2 of the main text, we omitted quantitative results on the GTAV dataset due to space limitations. Table 3 shows quantitative results on the GTAV dataset with metrics introduced in Sec.5.2 in the manuscript. As introduced in the paper, we prepared two versions of inputs, one manually annotated with centerlines of the people and their heights and the other with those automatically extracted from a multi-object tracker (MOT). We compared view birdification results using these two different inputs, which are referred to as Birdify-CLine and Birdify-MOT. The results show Birdify-CLine and Birdify-MOT achieve comparative performance in terms of rotation and translation errors, ∆r r r, ∆t t t since the localization of the observer is insensitive to pedestrian detection errors. On the other hand, in terms of pedestrian localization errors, ∆x x x and ∆x x x, Birdify-MOT results show inferior performance to manually annotated inputs. This is mainly due to the fact that we currently estimate the initial position of a pedestrian x x x 0 k relative to the observer position x x x t 0 by Eq. (1) in the main Table 2 : Overview of birdification dataset. For real trajectories, we selected scenes of Hotel, Univ, and Students by taking into account the number of people in the crowd. "Seq." corresponds to all the frames captured by a moving observer. "Len." denotes the number of frames included in one sequence. [46] text, whenever a new pedestrian appears in a frame. The accuracy of this initial estimate can be improved by fine-tuning the multi-object tracker or by using the pose of the person [7, 47] . We will explore these in future work. Table 3 : Birdification results of real trajectories. The relative and absolute localization errors of pedestrians, ∆x x x and ∆x x x, respectively, and the errors of camera ego-motion estimation, ∆r , and ∆t t t, computed for each frame whose mean values are shown. 

We also analyze failure cases of our view birdification to understand the limitations of the method. For this, we picked sequences from Univ data that showed a high error rate in terms of camera localization. Fig. 6 visualizes posterior distributions of the observer location p(x x x t 0 |Z t 1:K , X t−1 0:K ) and surrounding pedestrians x x x t 0 ∈X s p(X t 1:K |Z t 1:K , x x x t 0 )p(x x x t 0 )dx x x t 0 by sampling x x x t 0 ∈ X s in Eq. (4) and Eq. (5) in the manuscript, respectively. The first and third rows depict the ground truth trajectories of the camera and pedestrians from t to t + 9. The number of pedestrians changes from K = 3 to K = 5. The second and fourth rows visualize the posterior distributions for each of those two rows. As can be observed in the posteriors shown in the second row, the estimated observer location becomes a heavy-tailed distribution when the number of pedestrians in the crowd is small (K = 3). In contrast, as shown in the fourth row, the posterior distribution becomes sharper when the crowd is denser (K = 5). The ambiguity of localization increases when pedestrians walk almost parallel to the observer (e.g., timesteps t = t + 2 and t + 3). In contrast, the posterior distribution becomes sharp again when the camera observes more pedestrians walking in diverse directions. Moreover, when the camera observes a large number of pedestrians that conforms to a known crowd motion model, whether or not the camera motion is consistent with dominant crowd flow, the camera ego-motion estimates highly depend on the observed crowd movements and are less sensitive to assumed ego-motion model. That is, as long as the camera observes a sufficient number of pedestrians walking in diverse directions, our method can successfully birdify its views.

Social lstm: Human trajectory prediction in crowded spaces

Modelling social interaction between humans and service robots in large public spaces

Ego2top: Matching viewers in egocentric and top-view videos

Egotransfer: Transferring motion across egocentric and exocentric domains using deep neural networks

Mixture of trees probabilistic graphical model for video segmentation

Monoloco: Monocular 3d pedestrian localization and uncertainty estimation

Monoloco: Monocular 3d pedestrian localization and uncertainty estimation

DynaSLAM: Tracking, Mapping and Inpainting in Dynamic Scenes

Calibration of axial fisheye cameras through generic virtual central models

Long-term human motion prediction with scene context

Efficient belief propagation for early vision

Social gan: Socially acceptable trajectories with generative adversarial networks

Map building with mobile robots in populated environments

Map building with mobile robots in dynamic environments

Social force model for pedestrian dynamics

Dynamic slam: The need for speed

Clustervo: Clustering moving instances and estimating visual odometry for self and surroundings

The trajectron: Probabilistic multi-agent trajectory modeling with dynamic spatiotemporal graphs

Bbeep: A sonic collision avoidance system for blind travellers and nearby pedestrians

Blindpilot: A robotic local navigation system that leads blind people to a landmark object

Crowds by example. Computer graphics forum

Track to the future: Spatio-temporal video segmentation with long-range motion cues

Stereo vision-based semantic 3d object and ego-motion tracking for autonomous driving

A vision based top-view transformation model for a vehicle parking assistant

Where, what, whether: Multi-modal learning meets pedestrian detection

Taking a deeper look at the inverse compositional algorithm

Multimodal future localization and emergence prediction for objects in egocentric view with a reachability prior

Monolayout: Amodal scene layout from a single image

Abnormal crowd behavior detection using social force model

Mot16: A benchmark for multi-object tracking

General dynamic scene reconstruction from multiple view video

L2b: Learning to balance the safety-efficiency tradeoff in interactive crowd-aware robot navigation

You'll never walk alone: Modeling social behavior for multi-target tracking

Cross-view image synthesis using conditional gans

Rockstar Games. Rockstar Games

What the constant velocity model can teach us about pedestrian motion prediction

Action recognition in the presence of one egocentric and multiple static cameras

Socially compliant navigation through raw depth inputs with generative adversarial imitation learning

Modeling dynamic scenes recorded with freely moving cameras

Multichannel attention selection gan with cascaded semantic guidance for cross-view image translation

Local class-specific and global image-level generative adversarial networks for semantic-guided scene generation

Reciprocal n-body collision avoidance

Sizing up human height variation

Learning from synthetic data for crowd counting in the wild

Towards real-time multi-object tracking

Pose Flow: Efficient online pose tracking

Future person localization in first-person videos

Dsslam: A semantic visual slam towards dynamic environments

Body meshes as points

View synthesis by appearance flow

Generative adversarial frontal view to bird view synthesis