key: cord-0198419-8t2fckn2
authors: Chen, Ying; Kwon, Hojung; Inaltekin, Hazer; Gorlatova, Maria
title: VR Viewport Pose Model for Quantifying and Exploiting Frame Correlations
date: 2022-01-11
journal: nan
DOI: nan
sha: 13e7048c87a742e9d577b5a0479951a913783542
doc_id: 198419
cord_uid: 8t2fckn2

The importance of the dynamics of the viewport pose, i.e., the location and the orientation of users' points of view, for virtual reality (VR) experiences calls for the development of VR viewport pose models. In this paper, informed by our experimental measurements of viewport trajectories across 3 different types of VR interfaces, we first develop a statistical model of viewport poses in VR environments. Based on the developed model, we examine the correlations between pixels in VR frames that correspond to different viewport poses, and obtain an analytical expression for the visibility similarity (ViS) of the pixels across different VR frames. We then propose a lightweight ViS-based ALG-ViS algorithm that adaptively splits VR frames into the background and the foreground, reusing the background across different frames. Our implementation of ALG-ViS in two Oculus Quest 2 rendering systems demonstrates ALG-ViS running in real time, supporting the full VR frame rate, and outperforming baselines on measures of frame quality and bandwidth consumption.

Virtual reality (VR), which immerses users into computergenerated virtual environments [1] , has been showing promise in many applications including gaming, education, and healthcare [2] . VR is expected to boost global GDP by $450 billion by 2030 [3] . High expectation for VR, coupled with its known resource-hungry nature [4] , spurred a wide range of recent research that optimizes VR systems to reduce their communication and computing resource consumption [5] - [10] .

A particular feature of VR is the tight coupling of user's actions and the generated frames. In traditional visual media, the frames that are shown to the users are fixed. By contrast, in VR, to allow the users to independently explore virtual worlds, each frame is generated for the specific point of view of the user at a given time, i.e., for the specific viewport pose, namely the x, y, and z coordinates, and polar and azimuth orientation angles θ and φ, of user's VR headset or another interface to the virtual world (mobile phone [5] , [8] , [11] ; computer monitor [12] ). Hence, the correlations between different VR frames, and the performance of approaches that exploit them to reduce resource consumption in VR [8] - [10] , are intimately tied to the dynamics of user behavior within the VR experience. We examine and exploit this phenomenon in this work.

First, we develop a statistical model of users' VR viewport pose, comprised of the models of pose components, orientation and position. To develop this model, we collected a dataset of VR viewport trajectories in 3 VR games and across 3 different types of VR user interfaces, with over 5.5 hours of user data in total.

To characterize the correlation of viewport orientations between VR frames that are ∆t seconds apart, we obtain models of the change of azimuth and polar angles over ∆t. To characterize the displacement between VR viewport positions that are ∆t apart, we propose a modified random waypoint model (RWP) with random pause times ('paused-MRWP'). We demonstrate a close fit of the developed VR viewport pose model to the experimental data. To the best of our knowledge, this is the first statistical model of viewport pose in VR. Next, we apply the developed pose model to quantify the similarity of pixels across VR frames. For similar poses, the VR frames are highly redundant, as shown in Fig. 1 . It is thus possible to reduce resource consumption by rendering a set of 'reference' frames, and generating other, 'novel', frames by rendering only a portion of the frame while generating the rest by reusing the reference frame via view projection [8] , [13] . In this paper we derive analytical expressions for the visibility similarity (ViS) of pixels across different VR frames, relating the poses of the reference and novel frames through the developed viewport pose model, and accounting for the misalignments of the fields of view (FoVs) and the VR contentsto-viewport distance differences between the novel and the reference frames. We verify our analysis via Unity 3D [14] game engine-based simulations. Finally, we exploit the formulated ViS to adaptively divide VR frame contents into background and foreground, in order to render the foreground for the novel frames, while reusing the background. Separate treatment of background and foreground in VR frame generation has been considered in multiple lines of work [9] - [11] , [15] , [16] , which use heuristics for this separation. In this work, we propose a lightweight algorithm, ALG-ViS, that uses the analytical ViS to adaptively determine the distance threshold beyond which the contents are treated as background. We incorporate the developed ALG-ViS in two rendering systems based on Oculus Quest 2 (also known as Meta Quest 2), one on-device and one supported by edge computing. In both systems, ALG-ViS runs in real time, supporting the full VR frame rate, and outperforming a set of baselines on measures of frame quality and resource consumption.

To summarize, the main contributions of this paper are: (i) the first statistical model of viewport pose in VR, (ii) the analysis of the visibility similarity between different VR frames, and (iii) the analytically grounded algorithm for determining which contents to reuse across different frames. We make the VR viewport pose dataset and our implementation codes publicly available via GitHub. 1 The rest of this paper is organized as follows. We review the related work in §II, propose the viewport pose model in §III, and analyze the ViS and propose the ALG-ViS in §IV. We present the evaluation in §V and conclude the paper in §VI.

II. RELATED WORK Device pose modeling: VR frame generation requires information about the pose (position and orientation) of user's point of view. The vast body of work that has, over the years, modeled human mobility in many different applications [17] - [19] focused on human positions but not orientations. Orientations of handheld mobile devices are starting to be modeled in context of visible light communications [20] , [21] . We are unaware of existing statistical models of users' viewport pose in VR. The position component of our developed model builds on the modified RWP proposed in [22] , and one of the orientation components is related to the observations previously made in [23] . The comprehensive model we propose significantly modifies and extends these approaches. Predicting VR viewport pose: Recently, several approaches that predict pose or its components in VR systems have been developed [11] , [24] - [28] . Unfortunately highly immersive VR experiences are known to be negatively affected by errors in the pose prediction [8] . Our statistical approach can be seen as making decisions based on the distribution of viewport poses, rather than the specific predicted pose. Our evaluation demonstrates that this approach improves image quality and bandwidth variability over prediction-based approaches. Exploiting redundancy across VR frames: Multiple methods for reducing the required bandwidth and transmission latency in VR have been developed [5] - [10] . In a rich body of work [9] - [11] , [15] , [16] , a VR frame is classified into background that is relatively static across VR frames and foreground that is less similar from one frame to the next. The background can be rendered on the edge and prefetched by the VR device, while the foreground is rendered on the mobile device [11] , [15] , [16] ; the background can also be reused across multiple frames [9] , [10] . These studies use heuristics to separate the background and the foreground. Complementing this work, we derive an analytical expression for the interframe pixel similarity, which we use to split the background and the foreground adaptively, via a lightweight algorithm that can run on-device or on the edge server. Our evaluation demonstrates that our approach improves the VR image quality while consuming fewer resources. [33] 2,400 K 1,600 K Lite [34] 65.7 K 52.4 K Office [35] 207.6 K 143.7 K

We introduce our collected dataset in §III-A, describe our orientation model in §III-B and our position model in §III-C, and model the correlation between them in §III-D.

To complement existing datasets of users' head orientation in 360 • videos [23] , [29] , [30] and a small-scale singleinterface dataset of users' head pose in untethered VR [31] , we collected a dataset of users' viewport pose in 3 different VR games listed in Table I , across 3 different common VR user interface types. Specifically, we examined: (i) VR experienced through a VR headset and controlled through user head rotation and a VR controller ("headset VR"), (ii) "desktop VR" [12] , experienced through the user's desktop monitor and controlled through desktop's mouse and keyboard, and (iii) VR experienced through a mobile phone, controlled via moving the phone and tapping on it [32] . Our institutional review board (IRB)-approved data collection, conducted under COVID-19 restrictions, involved remote desktop VR and phone-based VR data collection via apps that we distributed to remote users, and a small number of socially distanced inlab experiments for headset and phone-based VR. In total, we recorded experiences of 5 users with headset and phonebased VR, and 20 users with desktop VR. For desktop and phone-based VR, each user explored the 3 VR games for 2-5 minutes (per game). For headset VR, the users explored each game for 2 minutes to avoid simulator sickness. Additional data collection protocol details and the dataset are provided via GitHub. 1 A VR viewport is depicted in Fig. 2 , where n is the unit vector along the optical axis of the camera. We first introduce the representation for the viewport orientation using variables related to n, followed by the definition of the statistical viewport orientation model. Definition 1 (Viewport orientation representation). The viewport orientation is the tuple (θ, φ), where polar angle θ ∈ [0, π] is the angle between n and the positive direction of the Y -axis, and the azimuth angle φ ∈ [−π, π) is the angle between the projection of n in the XZ-plane and the positive direction of the X-axis in the Earth coordinates XY Z. θ and φ characterize how users look vertically and horizontally.

Definition 2 (Viewport orientation model). The viewport orientation model is the tuple (p θ (θ), p ∆θ (∆θ), p ∆φ (∆φ)), where p θ (θ), p ∆θ (∆θ), and p ∆φ (∆φ) are the probability density functions (PDFs) of the polar angle θ, the polar angle change ∆θ ∈ [−π, π] over the time interval ∆t, and the azimuth angle change ∆φ ∈ [−π, π] over ∆t. In this paper we assume that p θ (θ), p ∆θ (∆θ), and p ∆φ (∆φ) are independent. ∆θ is given by ∆θ = θ nov − θ ref , where θ ref and θ nov are the polar angles of the reference and novel frames taken ∆t seconds apart. ∆φ is calculated as ∆φ = −π + mod

and ⌊·⌋ is the floor function, and φ ref and φ nov are the azimuth angles of the reference and novel frames taken ∆t seconds apart.

We examine azimuth angle change ∆φ rather than φ itself because φ can be assumed to be uniformly distributed when VR contents are scattered along different longitudes in the VR systems and there are no viewing preferences. Hence, the distribution of φ does not provide information about the correlation between different VR frames.

1) Distribution fit: We evaluate the distribution fit for the experimental measurements of θ, ∆θ, and ∆φ. In analyzing p θ (θ) and p ∆θ (∆θ), we fit the experimental data to a set of common statistical distributions. In analyzing p ∆φ (∆φ), the PDFs of which have irregular shapes for some values of ∆t, we fit the experimental data to the set of common distributions and mixed distributions of two different common distributions. As the error metric, we use the sum of squared errors (SSE) [36] between the data and the fitted distribution.

2) Statistical distribution of the polar angle θ: We found Laplace distributions, with means close to 90 • (89.0 • -92.3 • ) and scales ranging from 3.5 to 7.4, to best fit experimental data (see a summary in Table II and a fit example in Fig. 3 ). This is intuitive: it corresponds to humans having a bias for looking straight ahead, to the central parts of VR contents, without frequently tilting their heads. Among the 3 VR games, the scale values are the largest for VK (4.3-7.4), which has more contents scattered along different latitudes than the other games. Among the 3 interface types, the scale values are the smallest for headset VR (3.5-4.3), corresponding to users looking straight ahead rather than up and down. We hypothesize that this is due to the discomfort associated with tilting the head drastically while wearing a headset.

3) Statistical models for polar angle change ∆θ: The experimental distributions of ∆θ for different ∆t values closely fit zero-mean Laplace distributions. The scales b 1,θ of the Laplace distributions that yield the best fit for different ∆t values in Lite are shown in Table III . As ∆t increases, the correlation between polar angles decreases, leading to the increase of the scale with ∆t (e.g., from 0.305 for ∆t = 5/60 s to 3.407 for ∆t = 100/60 s for desktop VR). Among the 3 interface types, headset VR has the largest b l,θ when ∆t is small (e.g., 1.92 vs. 1.48 and 1.41 for ∆t = 30/60 s), indicating that the polar angle changes more rapidly. This is due to the ease of changing viewport orientation over a small time interval in headset VR. 4) Statistical models for azimuth angle change ∆φ: In our examinations, for some ∆t the distributions of ∆φ appeared to have canonical shapes, while for others they appeared as a mixture of distributions. Thus we fit the experimental data to both common and mixed distributions. We present the best distribution fits and their parameters, for a subset of ∆t values, in desktop VR for all 3 games jointly, in Table IV . For the cases of mixed distributions, logistic and Laplace in these examples, the PDF of the mixed distribution is written as

where µ l and b l are the mean and the scale of the Laplace distribution, µ lo and b lo are the mean and the scale of the logistic distribution, and p l is used to alter the fractions of the logistic and the Laplace distributions.

The best distribution fit for ∆φ changes with ∆t. When ∆t is small (i.e., when ∆t < β 1 ), ∆φ is best modeled by a Laplace distribution with a relatively small scale. When β 1 ∆t < β 2 , ∆φ is best modeled by a mixture of logistic and Laplace distributions, corresponding to users' tendency to change their head orientations only slightly over these time intervals (i.e., −15 • < ∆φ < 15 • ). Finally, when ∆t β 2 , the individual angle observations become uncorrelated and are best modeled by a uniform distribution U [−180 • ,180 • ) . From the collected desktop VR pose data, we obtain β 1 = 189/60 s and β 2 = 1549/60 s. Examples of these three cases are shown in Fig. 4 . For the other 2 VR interfaces, we observe similar patterns, but β 1 and β 2 are different: β 1 = 244/60 s and β 2 = 1003/60 s for headset VR, β 1 = 496/60 s and β 2 = 1006/60 s for phone-based VR.

In this section we introduce our model for VR viewport position. We focus on viewport position change over time interval ∆t, in order to analyze the ViS of VR frames that are ∆t apart.

Adopting the axis notation common in computer graphics [14] , [37] , the viewport positions in the Earth coordinates XY Z (shown in Fig. 2 ) are denoted as (x, y, z), where y is the height of the viewport, and x and z are the coordinates of the viewport positions in the XZ-plane (i.e., the ground plane). We assume that y is constant, e.g., y can be set to the human eye level. While changing y is important in some specific contexts, such as exergames [38] , [39] , y is fixed in the vast majority of typical VR experiences, and in native Oculus Integration app development, to avoid disorienting the users when they sink below or float above the ground in the virtual environment [40] .

To model the change of x and z, we propose a paused-MRWP position model in the infinite plane, based on our collected pose data and the modified RWP [22] . The model consists of an infinite sequence of points W n , n ∈ N + , called waypoints, the pause time S n at each waypoint, the duration T n to move along a straight line from W n to W n+1 with a constant velocity v, and the included angle α n between − −−−−− → W n W n+1 and the abscissa. The waypoint is expressed as 

where x n and z n are the coordinates of waypoints in the XZ-plane. The vector − −−−−− → W n W n+1 is called the n-th flight, and α n is called the direction of the n-th flight. At time 0, the viewport is at W 1 , and starts to move towards W 2 . Due to the constant velocity, the flight time T n for the nth flight is proportional to its length

The reason for assuming a constant velocity is that although acceleration can be used to produce more realistic movements, constantvelocity VR movement is known to be more comfortable than the movement with acceleration or deceleration [41] . We further assume that T n , S n , and α n are all i.i.d. distributed over n. Based on the experimental data in desktop VR, we will propose the models for T n , S n , and α n .

The collected data shows that the viewpoint movement in the XZ-plane is well approximated by a sequence of flights. We apply the standard angle model proposed in [19] to extract flights from the trajectories. Fig. 5 plots the trajectory in XZplane of one user in Lite and the extracted flights. Although the viewport does not move in a perfectly straight line during each flight, the trajectory is close to it.

Modeling flight duration. Our experimental data demonstrates that T n is exponentially distributed, and confirms that the paused-MRWP better models the flight time in VR than the classical RWP models [17] . Specifically, the PDF of the flight times T n , denoted as f Tn (t), is modeled as µe −µt . Fig. 6 shows the cumulative distribution function (CDF) of flight times for our collected flight samples, the paused-MRWP model with the best fitted µ, and the classical RWP with a constant velocity in VK game. We see that the flight times of the paused-MRWP model match the measurements better statistically, with the SSE as low as 0.0021. The exponential distribution with the best fitted µ will be used to model the flight times in §IV and §V.

Modeling pause time. We model the pause time S n accord-ing to the collected data. Fig. 7 shows that the exponential distribution with a "bump" around zero is a good fit to its distribution. The PDF of the pause times, denoted as f Sn (s), is modeled as (1 − c)λe −λs + cδ(s), where λe −λs stands for the PDF of the exponential distribution with parameter λ, Dirac delta function δ(s) models the "bump" around zero pause time, and c represents the fraction of the "bump".

Modeling flight direction. From our pose dataset, the flight angles α n follow the uniform distribution on [0, 2π).

With the developed paused-MRWP model, we will focus on observation intervals of duration ∆t, and derive an analytical expression for the moment generating function (MGF) of the displacement between two viewport positions. 

We note that during a small ∆t in VR systems, X ref and X nov are not necessarily at the waypoints. This is in contrast to conventional applications of RWP models, where the movement is observed at a larger timescale [17] , [19] .

2) Analysis: In analyzing M ψ (τ ), there are two mutually exclusive and exhaustive cases to consider. Setting T 0 = S 0 = 0 as the auxiliary variables, the Case 1 is the case

S n , i.e., we start observing the process when the movement is paused. Let Λ denote the event that Case 1 holds. Let Λ ′ denote the complement of Λ, and Λ ′ is the event that Case 2 holds. The Case 2 is the case that ∃j ∈ N + ,

S n , i.e., we start observing the process during a flight. We will first obtain the k-th (k 1) moment of ψ, m Λ (k) and m Λ ′ (k), for Cases 1 and 2 in Lemmas 1 and 2. Combined with the probability that Case 1 holds given in Lemma 3, we will obtain M ψ (τ ) in Theorem 1. Lemma 1. Assume Case 1 holds. Let T ′ j = 0, and S ′ j be the remaining pause duration after t s in the same pause interval.

We define the events A n and B n as 

and g i,h,n,k,m (1). To calculate E ψ k 1(A n ) , we express the k-th moment of ψ as v 2k

, where e i denotes the unit vector whose direction represents the moving direction α i of the i-th flight. Based on the property that α i is i.i.d. uniformly distributed on [0, 2π) and on the distributions of flight and pause times, we get the expressions of E ψ k 1(A n ) . Similar techniques are used to obtain E ψ k 1(B n ) .

Lemma 2. Assume Case 2 holds. Let S ′ j−1 = 0, and T ′ j be the remaining flight duration after t s in the same flight interval. Let T ′ i = T i for i > j and S ′ i = S i for i j. Let the events A n and B n be defined as

Proof. The proof is similar to that of Lemma 1.

Lemmas 1 and 2 yield the expressions for the k-th moment of the position displacement for arbitrary ∆t. In VR systems, we analyze the ViS for small ∆t. In this case, the number of flights and pauses in the observation interval is limited. The terms of E ψ k 1(A n ) and E ψ k 1(B n ) , n N , dominate m Λ (k) and m Λ ′ (k). N = 2 a good choice for N because the sum of the terms E ψ k 1(A n ) (or E ψ k 1(B n ) ), n 2, accounts for more than 98% of m Λ (k) (or m Λ ′ (k)) when t < 1 s and k 4. Hence, we can simplify the calculation of m Λ (k) and m Λ ′ (k) by discarding many terms corresponding to the cases of n > 2 when ∆t is small (e.g., ∆t < 1 s). Proof. The proof follows directly from Lemmas 1-3. . We obtainm(k) by randomly sampling 5000 pairs of viewport positions that are ∆t apart and calculating the position displacement. The gap |m(k) − m(k)| is smaller than 5 × 10 −6 when ∆t = 1/6 s and k = 4, and is smaller than 6% ofm(k) in other cases. Theorem 1 will be used to calculate the ViS in §IV.

VR viewports' position and orientation are correlated. Observing the azimuth angles φ and the walking directions (i.e., the included angle between − −−−−− → X ref X nov and the positive direction of X-axis) in our collected data, we find that the azimuth angles fixate around the walking direction. Similar observations have been made about human walking patterns in non-virtual worlds [42] , [43] . Supported by the pose data, we assume that the azimuth angle at observation start time t s is the same as the walking direction.

We introduce the model for the average visibility similarity given the viewport-to-content distance d (ViS(d)) in §IV-A. Then we apply the developed VR pose model to analyze the ViS(d) in §IV-B, and propose the ViS-based VR content splitting algorithm, ALG-ViS, in §IV-C.

The analytical model for ViS(d) we develop in this section characterizes the statistical average of the inter-frame pixel similarity over different pose changes given the viewportto-content distance d. We define the ViS(d) formally after introducing the camera model to present how viewport pose determines the rendered pixels.

Camera model. In VR, the virtual environment is constructed as computer-generated 3D contents, where the pixels in VR frames are generated by capturing the scenes with the camera. The camera is modeled as a standard pinhole camera following [44] . A 3D point in the virtual environment is projected through the pinhole to a pixel on the VR frame. We denote the camera's angle of view (AoV) as w f v . In addition, we assume that the far plane d f p of the camera, i.e., the largest viewport-to-content distance beyond which the contents cannot be rendered in the VR frame, is much larger than the viewport position change.

Consider two VR frames generated at t s and t s + ∆t, called reference and novel frames. The cameras that capture these frames are called reference and novel cameras, respectively. A pixel in the reference frame is projected back to the 3D point in the virtual environment, and the 3D point is projected to the corresponding pixel in the novel frame. The viewport-tocontent distance d is formally defined as the distance between the viewport position of the reference frame and the 3D point. The ViS(d) represents the average similarity of the pixels (of distance d) in the reference frame and their corresponding pixels in the novel frame, where the average is taken over different pose changes of reference and novel frames.

Definition 4 (ViS(d)). Let S d denote the number of the reference frame's pixels that are projected to the 3D points with distance d from the reference camera. d is expressed as We break down the ViS(d) into two terms: (1) the FoV term ViS f ov , which represents the fraction of the VR contents contained in the FoVs of both the novel and the reference cameras, and (2) the distance term ViS dst (d), which quantifies the ratio of the number of pixels representing the same 3D points (of distance d) in the novel and reference frames. In VR systems, the viewport moving closer to the VR contents will result in the use of more pixels to represent the contents. Note that ViS f ov is independent of d while ViS dst (d) is a function of d. In our analysis, we ignore the influence of occlusion in ViS(d), i.e., we do not consider the case where the occluded objects are in the FoV of both reference and novel frames, are rendered in the novel frame, but occluded by the other contents in the reference frame. In our numerical results, we show that their effect is small. We have ViS(d) = ViS f ov ViS dst (d). The ViS f ov and ViS dst (d) are defined formally below.

Definition 5 (FoV term). ViS f ov is defined as the statistical average of the multiplication of the fraction of overlapping polar and azimuth angles of reference and novel cameras. 

The fraction of the overlapping polar angles is obtained similarly. Hence, the FoV term is

Definition 6 (Distance term). ViS dst (d) is defined as

is the distance between the VR content and the novel camera, and ϑ is the included angle between Fig. 9 .

We first provide a closed-form expression for ViS f ov , and then approximate ViS dst (d) tightly with a small error.

where b l,θ is the scale of the fitted Laplace distribution of ∆θ, and p φ f = 2

Proof. See Appendix C.

We focus on small ∆t (e.g., ∆t 1 s which belongs to the first case in (4)), as the most relevant, in practice, to exploiting VR frame correlation. In this case, the FoV term only depends on w f v , b l , and b l,θ .

Theorem 3 (Distance term). For ε > 0, we can approximate

π . Specifically, ViS dst (d) can be approximated according to

Proof. See Appendix D. Proof sketch: Taking the average over ϑ and substituting m(1), (3) is rewrit-

g(i) and that the approxima-

to conclude the proof.

From (5), ViS dst (d) can be approximated by 1 + m(1)

. The approximation error κεM ψ − 1 ε 2 can be made arbitrarily small by choosing a small ε.

is convergent, and the sum of the first 30 terms provides a good approximation with an error εM ψ − 1 ε 2 < 0.01 in §V. The frame is selected as a reference frame; 5: Classify all VR contents as foreground contents; 6: else 7: d tr ← d f p ; 8: for (d = 0.5; d < d f p ; d ← d + 0.5) do 9: Calculate ViS(d); 10: if ViS(d) ViS tr then 11: d tr ← d; 12: break; 13: Classify the VR contents with d d tr as foreground contents, and other contents as background contents; 14: count ← count + 1; 15: Render the foreground contents. Reuse the background pixels from the reference frame by view projection;

Based on the average ViS for a given d, i.e., ViS(d), we adaptively split the contents to background and foreground, where the background (with a larger d) has a high ViS(d) and can be reused to reduce the resource consumption. To this end, we propose ALG-ViS given in Algorithm 1. For every R consecutive frames, the first frame is selected as the reference frame; the remaining R − 1 frames are the novel frames. In reference frames, all VR contents are classified as foreground contents. In novel frames, we calculate a distance threshold d tr such that ViS(d tr ) = ViS tr , where ViS tr is the threshold and a ViS(d) larger than ViS tr indicates high similarity of pixels in reference and novel frames. The VR contents that have the distance d d tr from the reference camera are classified as foreground contents, the other contents -as background contents. VR system renders the foreground contents and reuses the pixels for the background contents from the reference frame by view projection [8] , [13] . We only need to calculate ViS(d) for R − 1 values of ∆t when the inter-frame interval is fixed (e.g., when the system supports the full frame rate as in §V), which makes ALG-ViS even more lightweight.

We verify the analysis of ViS via simulations in §V-A and examine the performance of ALG-ViS in real-world VR implementations in §V-B. The parameters are listed in Table V unless otherwise specified.

A. Examining ViS 1) Simulation settings: We verify the ViS analysis via simulations using Unity Engine 2019.2.14f1 [14] with 3 VR games listed in Table I . The results are analyzed in MATLAB.

To simulate the ViS, we randomly sample 5000 pairs of viewport poses from the collected pose trajectories for reference and novel cameras. We obtain the pristine novel frame rendered by Unity, and the generated novel frame by view projection from the reference frame and its depth map, where each pixel in the depth map represents the distance of the VR contents to the reference camera. Among the generated novel frame's pixels whose corresponding 3D points are at a distance d from the reference camera, the pixels with RGB values with indistinguishable differences from the pristine novel frame constitute the set Φ(d); the other pixels form the set Φ c (d).

The ViS given d is calculated as the proportion of matched pixels

We consider pixel values as indistinguishable when the difference of the pixel values is less than C th ∈ N + in all RGB channels 2 .

2) Numerical results: Fig. 10 shows the ViS obtained in our simulations, ViS(d), and the analytically derived ViS, ViS(d), for headset VR and phone-based VR, for different d and ∆t.

The results for desktop VR are similar to the results for phonebased VR and are omitted. As expected, the ViS declines with ∆t (i.e., frames that are separated by a longer time interval are less similar), and increases with d (i.e., contents that are farther from the camera change less across different frames). We observe differences between VR interfaces as well: the ViS for headset VR (Fig. 10(a) ) is smaller than the ViS for the other interface types (Fig. 10(b) ). This is explained by the difference in movement dynamics we observed in our dataset: in headset VR, the users change their viewport orientations and flight directions more rapidly than in the other VR interfaces. Smaller ViS for headset VR implies that fewer VR contents will be classified as background contents. Finally, we note that the gap between the analytical results and the simulations is small, 4.1% on average for d < 20m and only 1.6% on average for d 20m. The gap can be attributed to the omission of object occlusions from our analytical derivations. The achieved highly accurate analytical ViS(d) in the high ViS regime (e.g., d 20m) is crucial for selecting the distance threshold d tr to ensure that the background has a high ViS.

The obtained ViS for the three VR games in desktop VR is shown in Fig. 11 . The results for the other 2 interface types exhibit the same trends and are omitted. Although VK has a larger number of triangles and vertices, which manifests in higher visual scene complexity, the ViS(d) of VK is larger than the ViS(d) of Office and Lite. This is because users' pose trajectories in VK have relatively smaller b l,θ and b l , and smaller µ and λ, corresponding to larger flight lengths and pause durations. In other words, users tend to change both their viewport orientations and positions more slowly in VK. We hypothesize that higher-complexity games may potentially be more engaging, which encourages the users to explore them in a slower, more deliberate fashion. The observed differences between the ViS for different games suggest that it is important to take specific game's pose characteristics into account.

We implement ALG-ViS and a set of baselines in two VR rendering systems, an on-device ("local") one and one supported by edge computing. Both systems display the generated frames in an Oculus Quest 2 VR headset [45] . On-device rendering system is implemented with build 30.0; its target frame rate is set to the default 72 fps. The edge-assisted system generates the VR frames in a Lenovo laptop, with Unity Engine 2019.1.14f1 and Google VR SDK [46] , and sends them to the headset over IEEE 802.11ac WiFi. The laptop is equipped with an AMD Ryzen 7 4800H CPU and an NVIDIA GTX 1660 Ti GPU. The target frame rate of this system is set to the laptop's default 60 fps. To ensure reproducibility, for all algorithms our evaluation is based on replaying 30 min of headset pose trajectories we collected (see §III-A). We examine the performance for all 3 games, and present the results for Lite, which are representative. We examine the required bandwidth for the edge-supported system in Fig. 12 , and the achieved SSIM, frame processing time, and CPU and GPU usage (monitored using the OVR Metrics Tool [47] ) for the local system in Fig. 13 .

We compare ALG-ViS to different approaches with the same reference frame interval R: ALG-FX-S, ALG-FX-T, and ALG-ML. In ALG-ViS, ViS tr is set to 0.945 to ensure high ViS values for background contents. We set R as the minimum integer such that d tr < d f p for every frame. In ALG-FX-S and ALG-FX-T, d tr is determined by the number of triangles N tr in the VR frame, similar to the near and far background splitting in [10] . Specifically, d th = max{qN tr , d f p }, where q is fixed for each game. For a fair comparison, in ALG-FX-T, we set q to make the average frame processing time of ALG-FX-T and ALG-ViS the same to compare the SSIM; in ALG-FX-S, we set q to make the average SSIM the same as ALG-ViS when comparing other metrics. In ALG-ML, we adopt an online ridge regression model, which has been shown to achieve state-of-the-art accuracy in 360 • video pose prediction [24] , [25] . Following [11] , we predict x, y, z, θ, and φ separately. We set the history and prediction windows as in [11] . We split the VR contents by calculating the ViS of the reference frame and the predicted VR frame.

The CDF of the required bandwidth of the edge-assisted rendering system shown in Fig. 12 demonstrates that ALG-ViS requires less bandwidth on average than ALG-FX-S (11.3% difference; ALG-FX-S consistently transmits more pixels to maintain the same SSIM as ALG-ViS), and has significantly smaller bandwidth variance than ALG-ML (88.4% reduction). Although ALG-ML can save bandwidth when it accurately predicts the viewport pose, the required bandwidth increases drastically when the prediction is erroneous. Generating less bursty traffic, ALG-ViS prevents transmission resource overprovisioning and potential TCP incast problems in edgeassisted VR systems, which helps supporting these systems better when compared with ALG-ML. Fig. 13 shows the CDFs of the SSIM, frame processing time, and CPU and GPU usage for the local rendering system. It shows that ALG-ViS improves frame quality and frame processing time while consuming fewer resources. The ALG-ViS ensures high average SSIM, outperforming ALG-FX-T by 3.2% and ALG-ML by 5.9%. The ALG-ML exhibits lower frame quality. This is because ALG-ML splits the foreground and background contents based on the predicted pose, and prediction errors lead to severe performance degradation [8] .

The ALG-ViS decreases the frame processing time by 16.1% and 33.4% compared to ALG-FX-S and ALG-ML. The CPU usage of ALG-ML is 20.5% higher than that of ALG-ViS due to the extra computation required to tune the regularization parameter and conduct the pose prediction. The GPU usage of ALG-FX-S is 33.7% higher than that of ALG-ViS because ALG-FX-S classifies more VR contents as foreground contents on average. These results demonstrate that the developed ALG-ViS is lightweight yet effective.

In this paper, we first propose a viewport pose model for VR systems based on the experimental measurements. We apply the pose model to adaptively select background contents that are reused across VR frames to reduce the communication and computation resource consumption, via quantifying the similarity of pixels across VR frames. Numerical results verify the pose model and the inter-frame pixel similarity analysis. Oculus Quest 2-based implementations of our adaptive background content selection approach show that it improves the image quality by 5.6% and reduces the variance of the required bandwidth by 88.4% compared to the method based on viewport pose prediction.

This work is supported in part by NSF grants CSR-1903136, CNS-1908051, and CAREER-2046072, and by an IBM Faculty Award.

Virtual Reality

Technical specification group services and system aspects

Seeing is believing

Creating the perfect illusion: What will it take to create life-like virtual reality headsets

Supporting mobile VR in LTE networks: How close are we

Towards viewport-dependent 6DoF 360 video tiled streaming for virtual reality systems

Cutting the cord: Designing a high-quality untethered VR system with low latency remote rendering

DeltaVR: Achieving high-performance mobile VR dynamics through pixel reuse

MUVR: Supporting multi-user mobile virtual reality with resource constrained edge cloud

Coterie: Exploiting frame similarity to enable high-quality multiplayer VR on commodity mobile devices

Firefly: Untethered multi-user VR for commodity mobile devices

Desktop VR is better than non-ambulatory HMD VR for spatial learning

Rendering-aware VR video caching over multi-cell MEC networks

Unity Technologies. (2021) The leading platform for creating interactive, real-time content

Furion: Engineering high-quality immersive virtual reality on today's mobile devices

Predictive scheduling for virtual reality

The node distribution of the random waypoint mobility model for wireless ad hoc networks

A brownian motion model for last encounter routing

On the Levy-walk nature of human mobility

Impact of random receiver orientation on visible light communications channel

Modeling the random orientation of mobile devices: Measurement, analysis and LiFi use case

Towards understanding the fundamentals of mobility in cellular networks

Saliency in VR: How do people explore virtual environments?

Flare: Practical viewport-adaptive 360-degree video streaming for mobile devices

Viewing the 360 • future: Trade-off between user field-of-view prediction, network bandwidth, and delay

Motion prediction and pre-rendering at the edge to enable ultra-low latency mobile 6DoF experiences

Predictive adaptive streaming to enable mobile 360-degree and VR experiences

LiveObj: Object semantics-based viewport prediction for live mobile virtual reality streaming

A dataset of head and eye movements for 360 • videos

360 • video viewing dataset in head-mounted virtual reality

6DOF virtual reality dataset and performance evaluation of millimeter wave vs. freespace-optical indoor communications systems for lifelike mobile VR streaming

ViVo: Visibility-aware mobile volumetric video streaming

Unity Technologies

The Cambridge dictionary of statistics

Rendering. Terathon Software LLC

Supernatural. (2021) Burn More. Sweat More. Have More Fun

2021) A new way to exercise

Oculus Integration for Unreal Engine basics

Rotational and translational velocity and acceleration thresholds for the onset of cybersickness in virtual reality

Direction of gaze while walking a simple route: Persons with normal vision and persons with retinitis pigmentosa

Look where you're going!": Gaze behaviour associated with maintaining and changing the direction of locomotion

A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses

Oculus Quest 2

Quickstart for Google VR SDK for Unity with Android

When Case 1 holds, there are two possible ways to end the observation interval at time t s + ∆t. If ∃j ′ ∈ N + ,S n , the observation end time t s + ∆t lies in a pause interval. Otherwise, t s + ∆t lies in a flight interval. If it lies in a pause interval, the movement includes some number of complete flights without any fractional flights. We do not observe any movement when n = 1. For n 1, A n is the event that t s + ∆t lies in a pause interval and there are n − 1 complete flights in [t s , t s + ∆t]. This leads to the first summation in (1).If t s + ∆t lies in a flight interval, the movement includes some number of complete flights and a fraction of the last flight. B n is the event that t s + ∆t lies in a flight interval and there are n − 1 complete flights in [t s , t s + ∆t]. This leads to the second summation in (1) .It only remains to calculate E ψ k 1(A n ) and E ψ k 1(B n ) to conclude the proof. We start with E ψ k 1(A n ) . Given T i , e i , the k-th moment of ψ is given by v 2kAccording to the flight direction model in §III-C, the direction of e i is i.i.d. uniformly distributed on [0, 2π). Thus, E ei,e k [e i e k ] is equal to 1 for i = k and 0 for i = k. Thus, expandingand averaging over e j+1 , e j+2 , · · · , e j+n−1 , only the terms in which all T i e i (j + 1 i j + n − 1) have even powers are non-zero. The number of these non-zero terms is n+k−2 n−2 . Further, let h(h n − 1) denote the number of non-zero complete pause intervals (excluding the first and last fractional pause intervals). Denote the set of the non-zero complete pausebe the sum of pause intervals in the observation interval. E ψ k 1(A n ) is given by (6) , shown at the top of next page. Using these observations, the expressions for E ψ k 1(B n ) can be obtained similarly, which completes the proof.

Let Γ n be the end time of the n-th flight, i.e., Γ n = n−1 i=0 S i + n i=0 T i . Then, for any T > 0, there exists n ∈ N such that Γ n T < Γ n+1 , and we can upper and lower bound p T according toThe lower bound includes an extra pause interval between Γ n and Γ n+1 by ignoring any possible fractional flight duration. On the other hand, the upper bound includes one complete flight duration between Γ n and Γ n+1 by ignoring S n .When T → ∞, using Lebesgue's dominated convergence theorem to replace the order of limit and expectation operators and the law of large numbers (as T grows large, n tends to infinity), it can be seen that both upper and lower bounds converge to p = λ/(1−c) λ/(1−c)+µ .

Seen from Fig. 9 , the overlapping azimuth angle isis the PDF of ∆φ obtained in Section III-B:Substituting p ∆φ (∆φ) into (7), we obtain the results for p φ f (the fraction of overlapping azimuth angles) in (4) .Similarly, we haveis the PDF of ∆θ obtained in §III-B.Combining fractions of overlapping azimuth angle and overlapping polar angle in (4) and (8) . This is because the walking direction is the same as φ ref (see §III-D), and the included angle of − −−−−− → X ref X 3D and positive direction of X-axis is uniformly distributed on φ ref −