Comparing the Quality of Highly Realistic Digital Humans in 3DoF and 6DoF: A Volumetric Video Case Study


Comparing the Quality of Highly Realistic Digital Humans in 3DoF and
6DoF: A Volumetric Video Case Study

Shishir Subramanyam* Jie Li Irene Viola Pablo Cesar

CWI, Amsterdam, The Netherlands

Figure 1: Users Evaluating Realistic Digital Humans in 6DoF (left) and 3DoF (right)

ABSTRACT

Virtual Reality (VR) and Augmented Reality (AR) applications have
seen a drastic increase in commercial popularity. Different repre-
sentations have been used to create 3D reconstructions for AR and
VR. Point clouds are one such representation characterized by their
simplicity and versatility, making them suitable for real time appli-
cations, such as reconstructing humans for social virtual reality. In
this study, we evaluate how the visual quality of digital humans, rep-
resented using point clouds, is affected by compression distortions.
We compare the performance of the upcoming point cloud compres-
sion standard against an octree-based anchor codec. Two different
VR viewing conditions enabling 3- and 6 degrees of freedom are
tested, to understand how interacting in the virtual space affects the
perception of quality. To the best of our knowledge, this is the first
work performing user quality evaluation of dynamic point clouds
in VR; in addition, contributions of the paper include quantitative
data and empirical findings. Results highlight how perceived visual
quality is affected by the tested content, and how current data sets
might not be sufficient to comprehensively evaluate compression
solutions. Moreover, shortcomings in how point cloud encoding
solutions handle visually-lossless compression are discussed.

Index Terms: Human-centered computing—Human computer in-
teraction (HCI)—HCI design and evaluation methods—User studies;
—Interaction paradigms—Virtual reality;

1 INTRODUCTION

Recent advances in capturing, media processing, and 3D rendering
technologies make VR/AR applications popular for mass consump-
tion [34]. In this new media landscape, point clouds are becoming
commonplace due to their simplicity and versatility. Still, the size of
dense point clouds is significant (a frame of roughly 1M points takes
around 19-20 MBytes), which need compression techniques before
transmission. This paper provides an exhaustive quality comparison
between different encoding configurations of digital humans, repre-
sented as point clouds. By investigating the differences in quality,
we provide insights about how to optimise the delivery for both
downloading and real-time communication. One key novelty of
this paper is to study the quality based on realistic consumption
conditions, in 3- and 6- Degrees of Freedom (DoF) scenarios.

*e-mail: {S.Subramanyam, Jie.Li, Irene.Viola, P.S.Cesar}@cwi.nl

Avatars are a core part of VR applications like social communi-
cation [28], sports training [21], or healthcare [20]. A major line
of scientific work has focused on how to make such avatars more
realistic, interactive, and autonomous [10, 24, 33]. In this paper, we
focus instead on point clouds as a suitable representation for digital
humans based on tele-portation principles [25]. In this case, the
research problem is not so much how to render and animate them to
make them look more realistic, but how to transport them optimally.

Given current advances in technology, real-time delivery of point
clouds is becoming a realistic alternative; focusing the attention
of the research community [23] and industry [32] in encoding and
transmission. Still, given the massive number of points per repre-
sentation, decisions need to be taken regarding the delivery (type
of encoder, bit-rate) to ensure an acceptable quality of experience
depending on the viewing conditions (3DoF, 6DoF). This is the core
research question this paper answers.

Contributions of the paper are two-fold: 1) It provides a first
evaluation of the quality of highly realistic digital humans repre-
sented as dynamic point clouds in immersive viewing conditions.
Existing protocols [5, 7, 8, 40, 42] did not consider the dynamic of
the point clouds, focused on one type of data set, and did not take
into account VR viewing conditions; 2) It provides quantitative sub-
jective results about the perceived quality of the contents, along with
qualitative insights on what is important for users in interacting with
digital humans in VR. Such results will help in better configuring
the network conditions for the delivery of points clouds for real-time
transmission, and have implications over ongoing research and stan-
dardisation work regarding the underlying compression technology.

Particularly, this paper extensively studies this current and rel-
evant area of research by proposing 1) a new evaluation protocol,
including the work to create dynamic point clouds for evaluation,
and 2) quality of experience results. These results are based on
an experiment with 52 participants, evaluating 72 stimuli based on
eight dynamic point cloud sequences. Each point cloud sequence
was compressed in four bit-rates, using two types of compression
techniques. These 72 stimuli were evaluated in two viewing con-
ditions (3DoF and 6DoF). The data gathered include rating scores,
presence questionnaires, simulator sickness reports, and time spent
watching the content. The results indicate that, while bit-rate savings
can be obtained by choosing one compression solution over another,
visually lossless compression has not been fully achieved by the
algorithms under evaluation, even at rather large bit-rates. Moreover,
the choice of content can have an impact on how users rate its quality,
influencing the discriminating power of the selected protocol.

127

2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR)

2642-5254/20/$31.00 ©2020 IEEE
DOI 10.1109/VR46266.2020.00-73


(a) V-PCC (b) MPEG Anchor

Figure 2: Point Cloud Digital Humans compressed using two point cloud codecs, V-PCC (left) and MPEG anchor (right), at the 4 selected
bit-rates.

2 RELATED WORK
2.1 Quality assessment for point clouds
Capturing and displaying volumetric videos is becoming feasi-
ble [2, 30]. Point clouds are frequently used as a data format for
volumetric video in augmented reality (AR) and virtual reality (VR)
applications. Point clouds collate a large number of geo-referenced
points to represent humans or objects in 3D. The color informa-
tion can be provided with each point [40]. To visualize 3D content
sufficiently, the number of points must be high, which results in
large size and increases the difficulty to store and transmit the point
clouds. To support low latency transmission in AR/VR applications
within a limited bandwidth, compression is necessary. However, it
remains challenging to measure and predict the acceptable quality
of compressed point clouds.

There is a growing interest on subjective quality assessment of
point clouds rendered on 2D displays. Zhang et al. [42] evaluated
the quality degradation effect of resolution, shape and color on static
point clouds. The results indicate that resolution is almost linearly
correlated with the perceived quality, and color has less impact than
shape on the perceived quality. Zerman et al. [40] compressed two
dynamic human point clouds using a state-of-the-art algorithm [22],
and assessed the effects of this algorithm and input point counts on
the perceived quality. Their results showed that no direct correlation
was found between human viewers’ quality ratings and input point
counts. In a recent study [11], a protocol to conduct subjective
quality evaluations and benchmark objective quality metrics were
proposed. The viewers passively assessed the quality of a set of
static point clouds, as animations with pre-defined movement path.
In a comprehensive work by Alexiou et al. [8], the entire set of
emerging point cloud compression encoders developed in the MPEG
committee were evaluated through a series of subjective quality
assessment experiments. Nine static models, including both humans
and objects, were used in the experiments. The experiments provided
insights regarding the performance of the encoders and the types of
degradation they introduce.

Only a limited number of point cloud quality assessment studies
have been conducted in immersive environments. Mekuria et al. [23]
evaluated the subjective quality of their codec performance in a
realistic 3D tele-immersive system, in which users were represented
as 3D avatars and/or 3D dynamic point clouds, and could navigate
in the virtual space using mouse cursor in a desktop setting. Several
aspects of quality, such as level of immersiveness, togetherness,
realism, quality of motion, were considered. Alexiou and Ebrahimi
[7] proposed the use of AR to subjectively evaluate the quality of
colorless point cloud geometry. Tran et al. [37] suggested that, in
case of evaluating video quality in an immersive setup, aspects such
as cybersickness and presence should not be overlooked.

For the objective evaluation of point clouds, there are two main
approaches. Considering the availability of point location and color
information, either point-based or projection-based metrics can

be used [36]. Current point-based approaches can assess either
geometry- or color-only distortions. For geometry errors, three met-
rics were commonly used in studies [3, 6, 7, 35, 36], namely the
point-to-point metrics, the point-to-plane metrics and the plane-to-
plane metrics. These metrics computed using the root mean square
(RMS) distance, mean square error (MSE) or Hausdorff distance.
Moreover, the geometric Peak-Signal-to- Noise-Ratio (PSNR) is
used for the point-to-point and point-to-plane metrics [35]. For color
distortions, the total color degradation value is based either on the
color MSE, or the PSNR, computed in either the RGB or the YCbCr
color spaces [8]. The projection-based approaches map the rendered
models onto planar surfaces, and conventional 2D imaging metrics
are employed [12]. A comprehensive study [8] showed that the per-
formance of the current objective metrics is not ideal, revealing the
need for better solutions. Therefore, in this study, we do not include
any above-mentioned objective metrics, and focus on subjective
quality assessment.

2.2 Point cloud compression

A single point cloud frame is represented by an unordered collec-
tion of points sampled from the surface of an object. In a dynamic
sequence of point clouds, there are no correspondences of points
maintained across frames. Thus, detecting spatial and temporal
redundancies is often difficult, making point cloud compression
challenging. Octrees have been used extensively as a space parti-
tioning structure to represent point cloud geometry. They are a 3D
extension of the 2D quadtree used to encode video and images.

Research into point cloud compression can be broadly divided
into two categories. The first is based on signal processing, Zhang et
al. [41] proposed a method to compress point cloud attributes using
a graph Fourier transform. They assume that an octree has been cre-
ated and separately coded for geometry prior to coding attributes. De
Queroz and Chou [27] used a region adaptive hierarchical transform
to use the colors of nodes in lower levels of the octree to predict
the colors of nodes in the next level. As these approaches require
expensive computations of graph laplacians, they are not suitable for
dynamic sequences in real-time applications. The second category
of point cloud codecs are based on extending legacy solutions from
image and video compression. Intra Frame coding in octrees can be
achieved by entropy coding the occupancy codes, as shown in [23].
The authors then compress the color attributes by mapping them to
a 2D grid and using legacy JPEG image compression.

In 2017, MPEG started a standardization activity to determine
a new standard codec for point clouds, to be launched in 2020.
They used the codec created by Mekuria et al. [23] as an anchor
to evaluate proposals. To encode dynamic point cloud sequences
MPEG provides two verification models [32], Geometry-PCC for
point clouds with a sparse distribution, and Video-PCC for dense
point clouds. V-PCC is based on leveraging existing 2D video codecs
to compress point cloud geometry and attributes.

128


3 METHODOLOGY
3.1 Dataset Preparation
A dataset of dynamic point cloud sequences was used from the
MPEG repository. All sequences were clipped to five seconds and
sampled at 30 frames per second. This included point cloud se-
quences [13] [14] captured using photogrammetry (Longdress, Loot,
Red and black, Soldier are shown in figure 3) and one sequence
of a synthetic character sampled from an animated mesh (Queen).
Four additional point cloud sequences; Manfred, Despoina, Sarge
(shown in Figure 3) and Rachel were added for the evaluation. These
sequences were created using motion captured animated mesh se-
quences.

Keyframes were selected at 30 frames per second and extracted
along with the associated mesh materials. Particular care was put in
ensuring the selected sequences have the characters facing the user
and speaking in their general direction. Then, 1 million points were
randomly sampled, independently per key frame to create a con-
sistent groundtruth dataset. The points are sampled from the mesh
surface with a probability proportional to the area of the underlying
mesh face. This was done to ensure no direct point correspon-
dences across point cloud frames, to mimic realistic acquisition and
maintain consistency with the rest of the dataset. The point clouds
sampled from meshes were used in Test T1 and the point clouds
captured using photogrammetry were used in test T2. The X, Y, Z
coordinates of each point is represented using an unsigned integer,
as is required for the current version of the V-PCC software. Colors
are encoded as 8bits per color in the RGB color space.

To encode the contents, we first use Release 7.0 of the VPCC
MPEG codec. For test 1, the configuration files provided by MPEG
for the Queen sequence are used for all the contents. We select the
rate points 1, 3 and 5 from the provided preset V-PCC configurations
and extend in to an additional final rate point using a Texture quanti-
zation parameter (QP) of 8, a geometry QP of 12 and an occupancy
precision of 2. We re-label the rate points as R1, R2, R3 and R4,
respectively. All sequences are encoded using the C2AI (Category
2 All Intra) config. For the photogrammetry sequences, we use the
predefined dedicated configuration files for each sequence, at the
same rate points. The VPCC compressed bitstream was used to set
the bitrate targets for R1 to R4, separately for each sequence.

We then use the MPEG anchor codec [23] in an all intra configu-
ration, and match the bit-rates per sequence and rate point (R1-R4)
with a tolerance of 10%, as defined in the MPEG call for proposals.
The codec was selected as it has a significantly lower encode and
decode time and is suitable for real-time applications, as demon-
strated by the authors. We use an octree depth from 7 to 10 for
the rate points R1 to R4 respectively. The highest possible JPEG
quantization parameter values were then chosen per sequence, while
meeting the target bit rate set using VPCC.

3.2 Experiment setup
All point cloud sequences were rendered using the Unity game en-
gine, by storing all the points of each frame in a vertex buffer, and
then drawing procedural geometry on the GPU. The point clouds
were rendered using a quadrilateral at each point location with a
fixed offset of 0.08 units (this corresponds to a side length of approx-
imately 2mm) around each point (placed at the centre) for all the
sequences, to be consistent. In the case of bitrate R1 generated using
the MPEG anchor, we increased the offset value to 0.16 by eye, as
the resulting point clouds were too sparse (shown in Figure 2b). We
maintain a fixed frame rate of 30fps throughout the experiment.

Participants were asked to wear an Oculus Rift Head Mounted
Display to view each of the point cloud sequences. For the 3DoF
condition, participants were asked to sit on a swivel chair placed
at a fixed location in the room and navigate using head movements
alone. For the 6DoF condition, participants were allowed to navigate
freely within the room, as shown in Figure 1. Each sequence was 5

seconds long, after which the playback looped around. We set the
background of the virtual room to mid-grey, to avoid distractions.
The Oculus Guardian System was used to display in-application wall
and floor markers if the participants got too close to the boundary.
We used a workstation with 2 GeForce GTX 1080 Ti in SLI for the
GPU and an Intel Core i9 Skylake-X 2.9GHz CPU.

3.3 Subjective methodology

To perform the experiments, the subjective methodology Absolute
Category Rating with Hidden References (ACR-HR) was selected,
according to ITU-T Recommendations P.910 [15]. Participants were
asked to observe the video sequences depicting digital humans, and
rate the corresponding visual quality on a scale from 1 to 5 (1-Bad,
2-Poor, 3-Fair, 4-Good, and 5-Excellent).

A series of pilot studies were conducted to determine the posi-
tioning of digital humans in the virtual space and the length of each
sequence, to ensure the sequences were running smoothly within the
limited computer RAM. Due to the huge size of the test material, it
was not possible to evaluate all 8 point cloud contents in one single
session, as long loading times would have brought fatigue to the
participants and corrupted the results. Thus, we decided to split the
evaluation into two separate tests: one focused on the evaluation of
contents obtained from random sampling of meshes (T1: contents
Queen, Manfred, Despoina and Sarge), and one focused on contents
acquired through photogrammetry (T2: contents Long dress, Soldier,
Red and black, and Loot). From each sequence, a subset of frames
comprising 5 seconds was selected.

Before the test took place, 3 training sequences depicting exam-
ples of 1-Bad, 5-Excellent and 3-Fair were shown to the users to
help them familiarize with the viewing condition and test setup, and
to guide their rating. The training sequences were created using
one additional content not shown during the test, to prevent biased
results. For test T1, content Ana was selected, whereas for test
T2, content Ulli Wagner was chosen. Each content sequence was
encoded using the point cloud compression algorithms under test.

For each test and viewing condition, 36 stimuli were evaluated.
For each stimulus, the 5 second sequence was played at least once
in full, and kept in loop until the participants gave their score. The
order of the displayed stimuli was randomized per participant and
per viewing condition, and the same content was never displayed
twice in a row to avoid bias. Moreover, the presentation order of
viewing conditions was randomized between participants, to prevent
any confounding effect. Two dummy samples were added at the
beginning of each viewing session to ease participants into the task,
and the corresponding scores were subsequently discarded.

After each view condition, participants were requested to fill in
the Igroup Presence Questionnaire (IPQ) [31] on a 1-7 discrete scale
(1=fully disagree to 7=totally agree) and Simulator Sickness Ques-
tionnaire (SSQ) on a 1-4 discrete scale (1=none to 4=severe) [18].
IPQ has three subscales, namely Spatial Presence (SP), Involvement
(INV) and Experienced Realism (REAL), and one additional general
item (G) not belonging to a subscale, which assesses the general
”sense of being there”, and has high loadings on all three factors,
with an especially strong loading on SP [31]. SSQ was developed to
measure cybersickness in computer simulation and was derived from
a measure of motion sickness [18]. For both T1 and T2, after the
two viewing conditions, participants were interviewed to 1) compare
their experiences of assessing quality in 3DoF and 6DoF, and 2)
reflect on the factors they considered when assessing the quality.

A total of 27 participants were recruited for T1 (12 males, 15
female, average age: 22,48 years old), whereas 25 participants were
recruited for T2 (17 males, 8 females, average age: 28,39 years old).
All participants were screened for color vision and visual acuity,
using Isihara and Snellen charts, respectively, according to ITU-T
Recommendations P.910 [15].

129


Figure 3: Sequences used for the test, from left to right: Manfred, Sarge, Despoina, Queen, Longdress, Loot, Red and black, Soldier

3.4 Data analysis
Outlier detection was performed separately for each test T1 and
T2, according to ITU-T Recommendations P.913 [16]. The recom-
mended threshold values r1 = 0.75 and r2 = 0.8 were used. One
outlier was found in test T1, and the corresponding scores were
discarded. No outliers were found in the scores collected for test T2.

After outlier detection, the Mean Opinion Score (MOS) was com-
puted for each stimulus, independently per viewing condition. The
associated 95% Confidence Intervals (CIs) were obtained assum-
ing a Student’s t-distribution. Additionally, the Differential MOS
(DMOS) was obtained by applying HR removal, following the pro-
cedure described in ITU-T Recommendations P.913 [16].

Non-parametric statistical analysis was applied to understand
whether statistical differences could be found among variables, using
the MATLAB Statistics and Machine Learning Toolbox, along with
the ARTool package in R [17].

4 RESULTS
4.1 Subjective quality assessment
Figures 4 and 5 shows the results of the subjective quality assess-
ment of the contents comprising test T1 and test T2, respectively,
for both 3DoF and 6DoF viewing conditions. In particular, the MOS
scores associated with the compressed contents are shown with solid
lines, along with relative CIs, whereas the dashed lines represent
the respective DMOS scores. The HR scores for each content are
represented with a solid line to indicate the mean, and a shaded plot
for the corresponding CIs.

To assess whether significant differences could be found between
the two visual conditions under test, we ran a Wilcoxon signed-
rank test on the scores obtained in the two DoF scenarios. The
Wilcoxon test was chosen as the gathered data was not found to
be normally distributed, according to the Shapiro-Wilk normality
test (W = 0.90, p < .001 and W = 0.91, p < .001 for tests T1 and
T2, respectively). Results of the Wilcoxon signed-rank test showed
statistical significance for DoF for test T1 (Z = 2.97, p = 0.0029,
r = 0.07), whereas for test T2, no significance was found (Z =
−1.96, p = 0.0502, r = 0.05). Values seems to indicate an effect of
the DoF in test T1; however the small r-value indicates that while
the effect apparently exists, it is small.

It can be observed that codec V-PCC has generally a more fa-
vorable performance with respect to the MPEG anchor. This is
especially evident for the contents acquired through photogramme-
try (see Fig. 5), for which the gap among the two codecs is more
pronounced. Wilcoxon signed-rank test confirmed statistical signifi-
cance for the two codecs (T1: Z = 9.87, p < .001, T2: Z = 20.18,
p < .001), albeit with different effect sizes between test T1 and T2
(r = 0.24 and r = 0.50, respectively).

A Friedman rank test performed on the scores revealed a signifi-
cant effect of the content on the final scores, for both sets of contents
(T1: χ 2 = 57.38, p < .001, T2: χ 2 = 17.31, p < .001). Table 1
shows the results of the post-hoc test conducted using Wilcoxon

Table 1: Pairwise post-hoc test on the contents for test T1 and T2,
using Wilcoxon signed-rank test with Bonferroni correction.

Z p r

T
1

Manfred - Sarge 3.78 <.001 0.12
Manfred - Despoina 2.09 0.036 0.07
Manfred - Queen 7.48 <.001 0.25
Sarge - Despoina 1.30 0.192 0.04
Sarge - Queen 9.94 <.001 0.33
Despoina - Queen 8.79 <.001 0.29

T
2

Long dress - Loot 7.03 <.001 0.23
Long dress - Red and black 1.08 0.279 0.05
Long dress - Soldier 4.11 <.001 0.14
Loot - Red and black 6.42 <.001 0.21
Loot - Soldier 3.32 <.001 0.11
Red and black - Soldier 3.10 0.002 0.10

Table 2: Pairwise post-hoc test on the bitrates for test T1 and T2,
using Wilcoxon signed-rank test with Bonferroni correction.

Z p r

T
1

R1 - R2 -14.21 <.001 0.50
R1 - R3 -16.85 <.001 0.60
R1 - R4 -17.08 <.001 0.60
R2 - R3 -12.61 <.001 0.45
R2 - R4 -14.45 <.001 0.51
R3 - R4 -8.75 <.001 0.30

T
2

R1 - R2 -14.20 <.001 0.50
R1 - R3 -16.85 <.001 0.60
R1 - R4 -17.08 <.001 0.60
R2 - R3 -12.61 <.001 0.45
R2 - R4 -14.45 <.001 0.51
R3 - R4 -8.57 <.001 0.30

signed-rank test with Bonferroni correction (α = .05/6). Contents
Manfred, Sarge and Despoina all show statistical significance with
respect to content Queen (p < .001, r > 0.20 for all pairs). Statis-
tical significance has also been observed between content Manfred
and Sarge, albeit with a smaller effect size ( p < .001, r = 0.12). For
contents acquired through photogrammetry, statistical significance
was found between contents Long dress and Loot, and Loot and
Red and black (p < .001, r > 0.20 in both cases), as well as be-
tween contents Long dress and Soldier, Loot and Soldier ( p < .001,
r > 0.10), and Red and black and Soldier (p = 0.0019, r = 0.10).
Results corroborate our previous statements on how contents Long
dress and Red and black appeared to be given different scores with
respect to contents Loot and Soldier.

We also ran a Friedman rank test on the scores to assess whether
the selected bit-rates were showing statistical significance. Results
confirmed that the bit-rates have a significant effect for both tests
(T1: χ 2 = 682.29, p < .001, T2: χ 2 = 667.39, p < .001). Post-hoc

130


(a) Manfred (b) Sarge (c) Despoina (d) Queen

(e) Manfred (f) Sarge (g) Despoina (h) Queen

Figure 4: MOS (solid line) and DMOS (dashed line) against achieved bit-rate, expressed in Mbps. HR scores are shown using a shaded yellow
plot. Each column represents a content in test T2, whereas first row and second row depict results obtained using the viewing conditions 3DoF
and 6DoF, respectively.

(a) Long dress (b) Loot (c) Red and black (d) Soldier

(e) Long dress (f) Loot (g) Red and black (h) Soldier

Figure 5: MOS (solid line) and DMOS (dashed line) against achieved bit-rate, expressed in Mbps. HR scores are shown using a shaded yellow
plot. Each column represents a content in test T2, whereas first row and second row depict results obtained using the viewing conditions 3DoF
and 6DoF, respectively.

analysis using Wilcoxon signed-rank test with Bonferroni correction
(α = .05/6), shown in Table 2 further confirmed that all pairwise
comparisons were statistically significant, for both test T1 and T2
(p < .001, r > 0.30 for all pairs).

In order to further analyze the effect of DoF conditions, contents,
codecs and bit-rates, and relative interactions, on the gathered scores,
we fitted a full linear mixed-effects model on the data, accounting for
randomness introduced by the participants. Due to the non-normality
of our data, the aligned rank transform was applied prior to the

fitting [39]. Since the transform is designed for a fully randomized
test, it is not suitable for the scores collected during the test, as the
HR addition makes the design matrix rank deficient. However, the
transform can be applied to the differential scores used to obtain
DMOS, as it follows a fully randomized design. Thus, it was decided
to perform the analysis on the differential scores.

For test T1, analysis of deviance on the full mixed-effects model
showed significance for main effects Content (F = 48.14, d f = 3,
p < .001), Codec (F = 51.01, d f = 1, p < .001) and bit-rate

131


(F = 375.35, d f = 3, p < .001), but not for DoF (F = 0.0003,
d f = 1, p = 0.988). Moreover, significant interaction effects were
found for DoF - Content (F = 4.31, d f = 3, p = 0.005), Content -
bit-rate (F = 5.88, d f = 9, p < .001) and Codec - bit-rate (F = 4.73,
d f = 3, p = 0.003). Post-hoc interaction analysis with Holm p-value
adjustment indicates that the difference between 3DoF and 6DoF has
statistical significance at 5% level when comparing contents Man-
fred and Queen (χ 2 = 10.34, p = 0.008), as well as Inspector and
Queen (χ 2 = 8.35, p = 0.019). In other words, the relative differ-
ence in scores between contents Manfred and Queen (and Inspector
and Queen) was not found to be statistically equivalent in 3DoF with
respect to 6DoF. This indicates that the DoF might have an effect on
how contents are scored with respect to one another, for example by
increasing or reducing their differences. Regarding the interaction
effect between contents and bit-rates, post-hoc interaction analysis
with Holm p-value correction showed statistical significance in dif-
ferences between contents Manfred and Queen at bit-rates R2 and
R4 (χ 2 = 29.52, p < .001), between contents Sarge and Despoina
at bit-rates R2 and R4 (χ 2 = 11.00, p = 0.028), between Sarge and
Queen at bit-rates R2 and R4 (χ 2 = 11.56, p = 0.022), and between
Despoina and Queen at bit-rates R1 and R2 (χ 2 = 13.75, p = 0.007),
R2-R3 (χ 2 = 13.59, p = 0.007) and R2-R4 (χ 2 = 45.13, p < .001).
Results can be explained considering that the low HR scores given
to content Queen meant a narrower range of ratings. Thus, bit-rate
point R2, for example, presents relatively higher differential scores
for Queen with respect to the rest of the contents, whereas for bit-rate
point R4, due to the HR removal, all contents have similar ratings.
This is reflected in the statistical analysis conducted on the scores.
Finally, post-hoc interaction analysis with Holm p-value adjustment
on differences between codecs and bit-rates shows that the difference
among codecs is statistically significant at 5% level only between R1
and R2 (χ 2 = 10.51, p = 0.007), R1 and R3 (χ 2 = 7.09, p = 0.031),
and R1 and R4 (χ 2 = 10.17, p = 0.007). This indicates that the dif-
ferences between codecs remain constant at all bit-rates, except
for R1. This is in line with what observed in Fig. 4, which show
similar trends for codec V-PCC with respect to the MPEG anchor,
except for the lowest bit-rate point, for which V-PCC achieves better
performance.

Results of analysis of deviance on the full mixed-effects model for
test T2 showed significance for main effects Content (F = 139.41,
d f = 3, p < .001), Codec (F = 692.24, d f = 1, p < .001) and bit-
rate (F = 485.11, d f = 3, p < .001), but not for DoF (F = 2.57,
d f = 1, p = 0.115), similarly to what was seen for test T1. In-
teractions were found significant at 5% level between Content
and Codec (F = 3.81, d f = 3, p = 0.01), Content and bit-rate
(F = 3.03, d f = 9, p = 0.001), and Codec and bit-rate (F = 39.40,
d f = 3, p < .001). The lack of significance in interactions involv-
ing DoF is in line with the results of the Wilcoxon signed-rank
test, which showed no significance for DoF in test T2 (Z = −1.96,
p = 0.0502, r = 0.05). Post-hoc interaction analysis with Holm p-
value adjustment shows significance at 5% level for the differences
among codecs for content Long dress with respect to content Loot
(χ 2 = 10.09, p = 0.009). This confirms what can be seen in Fig. 5:
the gap among codecs is more prominent for content Loot with re-
spect to Long dress, probably due to the reduced range associated
with a low-rated HR. Post-hoc analysis on the interaction between
contents and bit-rates indicates statistical significance at 5% level
for differences among contents Long dress and Soldier when consid-
ering differences between bit-rates R1-R4 (χ 2 = 17.03, p = 0.001)
and R2-R4 (χ 2 = 11.81, p = 0.021), and among contents Red and
black and Soldier for differences between R1 and R4 (χ 2 = 11.80,
p = 0.021). Again, this can be explained considering that both Long
dress and Red and black received remarkably lower scores, which
resulted in a narrower rating range. Thus, differences among lowest
and highest bit-rates are quite different between those two contents
and Soldier, which benefited from a larger rating span. Lastly, post-

1 2 3 4 5
Given score

0

2

4

6

8

10

12

14

A
vg

 t
im

e
 (

s)

3DoF
6DoF

(a) T1

1 2 3 4 5
Given score

0

2

4

6

8

10

12

14

A
vg

 t
im

e
 (

s)

3DoF
6DoF

(b) T2

Figure 6: Average time spent looking at the sequence (in seconds)
and relative CIs, against score given to the sequence, for 3DoF (blue)
and 6DoF (red), in test T1 (left) and T2 (right).

hoc analysis on the interaction between codecs and bit-rates reveals
statistical significance at 5% level for all pairwise comparison, ex-
cept R1-R3 (R1-R2: χ 2 = 14.60, p < .001, R1-R4: χ 2 = 46.58,
p < .001, R2-R3: χ 2 = 13.81, p < .001, R2-R4: χ 2 = 113.34,
p < .001, R3-R4: χ 2 = 48.02, p < .001 ). Indeed, in Fig. 5 it is
quite evident that the curves for the two codecs follow different
trends. In particular, codec V-PCC seems to saturate between R2
and R3, whereas a steeper slope is observed for the MPEG anchor.

4.2 Additional questionnaires and interaction data
4.2.1 IPQ & SSQ Questionnaires

For T1 and T2, the collected IPQ data under each subscale are all
normally distributed as examined by the Shapiro-Wilk test (p >
0.05). A paired sample t-test was applied to check the differences
between 3DoF and 6DoF in terms of SP, INV, REAL and G. For
T1, there was a significant difference in SP between 3DoF (M=4.13,
SD=0.92) and 6DoF (M=5.04, SD=0.67), t(26)=-4.44, p < .001,
Cohen’s d = 0.52 and also a significant difference in G between
3DoF (M=4.11, SD=1.28) and 6DoF (M=4.96, SD=1.13), t(26)=-
2.60, p < .01, Cohen’s d = 0.64. For T2, SP was also significantly
different in 3DoF (M=4.16, SD=1.17) and 6DoF (M=4.83, SD=1.12),
t(24)=-3.48, p < .01, Cohen’s d = 0.45 and so was G between 3DoF
(M=4.20, SD=1.61) and 6DoF (M=5.08, SD=1.19), t(24)=-3.56,
p < .01, Cohen’s d = 0.71. Other factors showed no significant
differences between 3DoF and 6DoF in both T1 and T2.

With respect to SSQ, no significant differences (p > 0.05) were
found between 3DoF and 6DoF in terms of cybersickness. We
further tested whether there were order effects in experiencing cyber-
sickness, where half of the participants started with 6DoF as the first
condition and 3DoF as the second, and the remainder the inverse. No
significant differences (p > 0.05) were found for any order effects
in experiencing cybersickness.

4.2.2 Interaction time

Interaction time was found to be strongly correlated with MOS val-
ues in a study conducted on light field image quality assessment [38].
In particular, it was found that users tended to spend more time
interacting with contents at high quality, whereas for low quality
scores, less time was spent looking at the contents. In order to see
whether similar trends could be observed in our data, we compared
the average time spent watching the sequence in 3DoF and 6DoF,
separately for each quality score given by the participants. Results
are shown in Fig. 6. A positive trend can be observed between the
given score and the average time spent looking at the sequence, with
the exception of score 5, which for test T2 shows a negative trend
with respect to the time. However, it should be considered that on
average, a small percentage of scores equal to 5 were given in test
T2 (10% of the total scores), thus, variations may be due to the
difference in sample size. It is also worth noting that, on average,

132


participants spent more time looking at the sequences in 6DoF, with
respect to the 3DoF case. Indeed, several participants pointed out
that the lowest scores were the fastest to be given, whereas for higher
quality, it was harder to decide on the rating.

4.2.3 Interviews

We asked the same interview questions for T1 and T2. So, we com-
bined the interview transcripts of 52 participants (T1=27, T2=25).
The categorized answers are presented as follows:

Factors considered when assessing quality. 56% of the partici-
pants mentioned that they assessed the quality based on three criteria:
1) overall outline and pattern distortion on body and on clothes, 2)
natural gestures and movements of the digital humans, and 3) visual
artifacts such as blockiness, blurriness, and extraneous floating arti-
facts. 48% of the participants mentioned the quality assessment cri-
teria are content related, who agreed that it is easier to spot artifacts
for the content with complex patterns (e.g., Long dress) and domi-
nant colors (e.g., Red and black) than the content with uniformed
colors (e.g., Soldier and Sarge). 46% of the participants considered
facial expressions as an unignorable factor for quality assessment,
which they believe is an important cue for social connectedness. For
the extraneous floating artifacts (e.g., bubbles flickering outside the
digital humans), 23% found it very annoying and lowered the overall
quality for the content, but a few participants (8%) thought these
artifacts do not influence their quality judgement.

Difficulties in assessment. 42% of the participants pointed out
the difficulties in assessing the quality, especially for the high quality
contents, which are not perfect and still have missing details like
blurry faces or wrong fingers. 15% of the participants specifically
pointed out that it is difficult to distinguish between quality level 3
to 5. 17% of the participants commented that it gradually became
easier in rating the quality when they adapted to the contents. So,
the second viewing condition was easier for them.

Comparison between 3DoF and 6DoF. 52% of the participants
preferred 6DoF, because it allowed them to move closer to examine
the details (e.g., shoes and fingers). They felt more realistic when
walking in the virtual space. However, they also commented that
3DoF offered a fixed distance between them and digital humans,
enabling a more stable and focused assessment. 21% of the partic-
ipants preferred relaxation and passiveness in 3DoF, because they
did not find much differences between 3DoF and 6DoF in terms of
quality assessment, but they found 3DoF is less nauseous than 6DoF.

4.3 Analysis of results
Results vary considerably depending on the content under assess-
ment. In particular, for test T1, content Queen is generally given
lower ratings with respect to the other contents in the test. This is
made evident by the MOS score given to the HR, which is equal to
3.35 for the 3DoF and 6DoF condition, indicating that even when
uncompressed, the content was never considered as having a good
quality. As a result, the MOS scores computed for the content have
a limited range, spanning between 1.08 and 3.35 for the 6DoF case,
and between 1 and 2.58 for the 3DoF (excluding HR). Such a nar-
row range is inadequate in expressing the quality variations among
different compression parameters: for the 3DoF case in particular,
paired t-test at 5% significance shows that bit-rate points R3 and R4
are statistically equivalent for both codecs, and for codec V-PCC
R2 is considered statistically equivalent to R4, despite the latter
being 10 times as large. Statistical analysis results confirmed that
content Queen showed different rating patterns with respect to the
other contents. The ratings given to the rest of the contents com-
prising T1 have a larger range, seemingly covering the entire rating
space. Trends show that codec V-PCC is generally preferred to the
MPEG anchor, especially at low bit-rates, whereas for the highest
bit-rate point R4, the codecs are always statistically equivalent at 5%
confidence level, in both 3DoF and 6DoF.

It is worth noting that the two codecs seldom reach the same
quality level as the uncompressed HRs. In particular, for the 3DoF
viewing condition, transparent quality (as in, the level of quality
for which the distortions are “transparent” to the user, meaning
that statistical equivalence with the HR has been observed) is only
achieved by content Manfred at bit-rate R4, by both codecs. On
the other hand, in the 6DoF scenario, V-PCC encoded contents at
bit-rate R4 seem to always be statistical equivalent to the HR.

Rating variability among different contents is even more visible
for contents acquired through photogrammetry. In particular, at high
bit-rates contents Long dress and Red and black are consistently
given lower scores with respect to contents Loot and Soldier, for
both DoF conditions. In fact, as seen with content Queen above,
both contents do not reach MOS levels higher than 4, even when
considering the HR content. This indicates that the source content
is never considered of excellent quality, even when no compression
artifact is involved. This impacts the way scores are distributed
across the rating space: left with a smaller rating range (as the higher
rating values are never given), MOS results show that contents
compressed at high bit-rates are considered statistically equivalent to
the respective HR. This is particularly evident when considering the
DMOS scores, which have an operational range between 4 and 5 for
codec V-PCC, for a range of bit-rates spanning between 4 and 120
Mbps. On the other hand, for contents that were given higher scores
(Loot and Soldier) and use the full rating space, results indicate what
was already seen in test T1: no codec is able to reach transparent
quality, meaning that scores given to the compressed content are
always statistically different with respect to the HR.

Decisions on which codec to employ should be made depending
on the use case. The MPEG anchor is more suitable for real-time
system, due to its fast encoding time, and at high enough bitrates,
differences with the other codec become less noticeable. V-PCC, on
the other hand, might be more appropriate for on-demand streaming
and storage, since it retains better quality for the same bitrate. For
the majority of the contents under test, a bitstream size between
20 and 40Mbps seems to provide an acceptable quality. However,
regarding the selection of the appropriate target bitrate, the decision
should be made taking into account other factors, such as network
conditions, available bandwith and scene complexity.

Statistical analysis showed a small effect of the chosen DoF
condition on the gathered scores for test T1. In general, the two
visualization scenarios led to similar trends in MOS values; however,
several participants pointed out that, while 3DoF offered a more
stable assessment, as the same point of view is used for all contents,
6DoF felt more realistic. Any decision between the two viewing
conditions for quality assessment, thus, should be made consider-
ing the trade-off between immersive, personalized experience, and
fairness of comparison between solutions.

5 DISCUSSION
5.1 Datasets
Despite the rich literature in point cloud acquisition and compression,
few point cloud datasets are publicly available. This is especially
true when considering point cloud datasets depicting photo-realistic
humans. One of the most popular and widely used full-body dataset,
created by 8i Labs [13], consists of only 4 individual contents,
whereas the HHI Fraunhofer dataset has 1 individual content [14].
In the context of point cloud compression, such scarcity of available
data may lead to compression solutions being designed, optimized
and tested while considering a considerably narrow range of input
data, thus leading to algorithms that are overfitted to the specifics
of the acquisition method used to obtain the contents. The conse-
quences of such a scenario are reflected in our results. Whereas
for the contents assessed in test T2 a large difference was observed
between codec V-PCC and the MPEG anchor, for the contents in
test T1 the gap was markedly lower, and indeed the significance of

133


the effect of the codec selection had a smaller effect size for test T1
with respect to test T2, as seen in section 4.1. Test T2 consisted of
contents that had been used in multiple quality assessment experi-
ments [8, 9, 11, 36], notably including the performance evaluation of
the upcoming MPEG standard [32]. On the other hand, test T1 in-
cluded contents that have not been used so far in assessment of point
cloud compression solutions. The discrepancies in the results of the
subjective quality assessment campaign indicate that performance
gains may vary considerably when new contents are evaluated. A
larger body of contents depicting digital humans, involving several
acquisition technologies, is needed in order to properly design, train
and evaluate new compression solutions in a robust way.

5.2 Personal preferences and bias

Subjective evaluation experiments are complicated by many aspects
of human psychology and viewing conditions, such as participants’
vision ability, translation of quality perception into ranking scores,
adaptations and personal preferences for contents. Through carefully
following the ITU-T Recommendations P.913 [16], we are able to
control some of the aspects. For example, eliminate the scores
given by the participants with vision problems; train participants
to help them understand the quality levels; randomize the stimuli
and viewing conditions to minimize the order effects. However,
we noticed that personal preferences towards certain contents are
difficult to control. Satgunam et al. [29]) found that their participants
were divided into two preference groups: prefer sharper content
versus smoother content. Similarly, Kortum and Sullivan [19] found
that the ”desirability” of participants had an impact on video quality
responses, with a more desirable video clip being given a higher
rating. In our experiments, content Queen is generally given lower
ratings with respect to the other contents. In the interviews, many
participants (27%) expressed dislike towards Queen, because of
her lifeless look and static gestures; 40% showed their preference
towards Soldier, due to his high-resolution facial features, unitoned
clothes and natural movements. This observation suggests that
quality assessment may need to be adjusted based on content and
viewer preferences, and offering training with different contents.

5.3 Technological constraints and limitations

The two codecs used in this experiment introduce different distor-
tions during compression. As the MPEG anchor codec uses the
octree data structure to represent geometry, the number of points
in the decoded cloud varies exponentially based on the tree depth.
Thus, at lower bitrates, the decoded point clouds are quite sparse,
and when the point size is increased to make them appear watertight,
they have a block-y appearance. This codec design allows for future
optimizations based on human perception of 3D objects in VR. The
low delay encoding and decoding of this codec makes it suitable
for real time applications such as social VR. On the other hand, the
V-PCC codec leverages existing 2D video codecs to compress both
geometry and color, which introduces noise in terms of extraneous
objects, and general geometric artifacts such as misaligned seams.
However, the approach yields better results at low bitrates, as demon-
strated in our results. The codec is optimized for human perception
of 2D video and this might not transfer to perception of 3D objects
in VR. The mapping from 3D to 2D is critical to codec performance,
thus the encoding phase has high complexity. Decoding has a lower
delay, as it benefits from hardware acceleration of video decoders
on GPUs, making this approach suitable for on demand streaming.

One of the main shortcoming of both compression solutions lays
in their inability to reach visually-lossless quality, as demonstrated
by our results. Achieving a visually pleasant result is of paramount
importance for the market adoption of the technology; indeed, poor
visual quality might lead consumers to tune off from the experience
altogether [1]. Visual perception should be taken into account when
designing compression solutions, especially at high bitrates, to en-

sure that in absence of strict bandwidth constraints, excellent quality
can be achieved.

5.4 Protocols for subjective assessment in VR

Choosing the right methodology to follow in order to collect users’
opinions is a delicate matter, as it can influence the statistical power
of the collected score, and in some cases lead to difference in results.
Single stimulus methodologies, in particular, lead to larger CIs with
respect to double stimulus methodologies, and are more subject to
be influenced by individual content preference [16]. An early study
comparing single and double stimulus methodologies for the evalu-
ation of colorless point cloud contents indicated that the latter was
more consistent in recognizing the level of impairment, as relative
differences facilitate the rating task [4]. However, the study pointed
out that the single stimulus methodology shows more discrimination
power for compression-like artifacts, albeit at the cost of wider CIs.

Double stimulus methodologies, while commonly used in video
quality assessment and widely adopted in 2D-based quality assess-
ment of point cloud contents [8, 11, 32], are tricky to adopt in VR
technology, due to the difficulties in displaying both contents simul-
taneously in a perceptually satisfying way [26], while ensuring a fair
comparison between the contents under evaluation. When dealing
with interactive methodologies, in particular, synchronous display of
any modification in viewport is usually enforced, to ensure that the
two contents are always visible at the same condition [8, 38]. This is
clearly challenging to implement in a 6DoF scenario, in which users
are free to change their position in the VR space at any given time.
Positioning the two contents side by side in the same virtual space
would mean that, at any given time, they are seen from two different
angles; the same problem would arise when temporal sequencing is
employed. A toggle-based method like the one proposed in [26] is
not applicable to moving sequences, as different frames would be
seen between stimuli.

In our study, we saw that content preference had an impact on
the ratings, as several contents were deemed of lower quality, as the
scores given to the HR exemplify. Such bias resulted in a reduced
rating range for the contents. Results of the interviews also pointed
out that naturalness of gestures were an important criteria in assess-
ing the visual quality. Such components would not be normally
evaluated in a double stimulus scenario; however, they are important
in understanding how human perception reacts to digital humans.

6 CONCLUSION

We compare the performance of the point cloud compression stan-
dard V-PCC against an octree-based anchor codec (MPEG anchor).
Participants were invited to assess the quality of digital humans
represented as dynamic point clouds, in both 3DoF and 6DoF condi-
tions. The results indicate that codec V-PCC has a more favorable
performance than the MPEG anchor, especially at low bit-rates. For
the highest bit-rate, the two codecs are often statistically equivalent.
Results indicate that the content under test has a significant influence
on how the scores are distributed; thus, new data sets are needed in
order to comprehensively evaluate compression distortions. More-
over, current encoding solutions, while efficient at low bitrates, are
unable to provide visually lossless results, even when large volumes
of data are available, revealing significant shortcomings in point
cloud compression. We also point out that commonly-used double
stimulus methodologies for quality evaluation often reduce the rating
task to a difference recognition, while insights on the quality of the
original contents are missed.

ACKNOWLEDGMENTS

This work is funded by the European Commission H2020 program,
under the grant agreement 762111, VRTogether, http://vrtogether.eu/

134


REFERENCES

[1] OTT: Beyond Entertainment Consumer Survey Report. https://www.

conviva.com/research/ott-beyond-entertainment/.

[2] D. S. Alexiadis, D. Zarpalas, and P. Daras. Real-time, full 3-D recon-

struction of moving foreground objects from multiple consumer depth

cameras. IEEE Transactions on Multimedia, 15(2):339–358, 2012.
[3] E. Alexiou and T. Ebrahimi. On subjective and objective quality evalu-

ation of point cloud geometry. In 2017 Ninth International Conference
on Quality of Multimedia Experience (QoMEX), pp. 1–3. IEEE, 2017.

[4] E. Alexiou and T. Ebrahimi. On the performance of metrics to predict

quality in point cloud representations. In Applications of Digital Im-
age Processing XL, vol. 10396, p. 103961H. International Society for
Optics and Photonics, 2017.

[5] E. Alexiou and T. Ebrahimi. Impact of visualisation strategy for sub-

jective quality assessment of point clouds. In 2018 IEEE International
Conference on Multimedia & Expo Workshops (ICMEW), pp. 1–6.
IEEE, 2018.

[6] E. Alexiou and T. Ebrahimi. Point cloud quality assessment metric

based on angular similarity. In 2018 IEEE International Conference
on Multimedia and Expo (ICME), pp. 1–6. IEEE, 2018.

[7] E. Alexiou, E. Upenik, and T. Ebrahimi. Towards subjective quality

assessment of point cloud imaging in augmented reality. In 2017
IEEE 19th International Workshop on Multimedia Signal Processing
(MMSP), pp. 1–6. IEEE, 2017.

[8] E. Alexiou, I. Viola, T. M. Borges, T. A. Fonseca, R. L. de Queiroz, and

T. Ebrahimi. A comprehensive study of the rate-distortion performance

in mpeg point cloud compression. APSIPA Transactions on Signal and
Information Processing, 8:27, 2019. doi: 10.1017/ATSIP.2019.20

[9] E. Alexiou, P. Xu, and T. Ebrahimi. Towards modelling of visual

saliency in point clouds for immersive applications. In 26th IEEE
International Conference on Image Processing (ICIP), 2019.

[10] J. Constine. Facebook animates photo-realistic avatars to mimic VR

users’ faces, 2018.

[11] L. A. da Silva Cruz, E. Dumić, E. Alexiou, J. Prazeres, R. Duarte,

M. Pereira, A. Pinheiro, and T. Ebrahimi. Point cloud quality evalua-

tion: Towards a definition for test conditions. In 2019 Eleventh Inter-
national Conference on Quality of Multimedia Experience (QoMEX),
pp. 1–6. IEEE, 2019.

[12] R. L. de Queiroz and P. A. Chou. Motion-compensated compression

of dynamic voxelized point clouds. IEEE Transactions on Image
Processing, 26(8):3886–3895, 2017.

[13] E. d’Eon, B. Harrison, T. Myers, and P. A. Chou. 8i Vox-

elized Full Bodies - A Voxelized Point Cloud Dataset, ISO/IEC

JTC1/SC29 Joint WG11/WG1 (MPEG/JPEG) input document

WG11M40059/WG1M74006, Geneva. January 2017.

[14] T. Ebner, I. Feldmann, O. Schreer, P. Kauff, and T. v. Unger. HHI

Point cloud dataset of a boxing trainer, ISO/IEC JTC1/SC29 Joint

WG11/WG1 (MPEG/JPEG) input document MPEG2018/m42921,

Ljubljana. July 2018.

[15] ITU-T P.910. Subjective video quality assessment methods for mul-

timedia applications. International Telecommunication Union, April

2008.

[16] ITU-T P.913. Methods for the subjective assessment of video quality,

audio quality and audiovisual quality of Internet video and distribution

quality television in any environment. International Telecommunication

Union, March 2016.

[17] M. Kay and J. Wobbrock. mjskay/artool: Artool 0.10.6, Feb. 2019. doi:

10.5281/zenodo.2556415

[18] R. S. Kennedy, N. E. Lane, K. S. Berbaum, and M. G. Lilienthal.

Simulator sickness questionnaire: An enhanced method for quantifying

simulator sickness. The international journal of aviation psychology,
3(3):203–220, 1993.

[19] P. Kortum and M. Sullivan. The effect of content desirability on

subjective video quality ratings. Human factors, 52(1):105–118, 2010.
[20] S. Y. Liaw, G. A. C. Carpio, Y. Lau, S. C. Tan, W. S. Lim, and P. S.

Goh. Multiuser virtual worlds in healthcare education: A systematic

review. Nurse education today, 65:136–149, 2018.
[21] J.-L. Lugrin, M. Landeck, and M. E. Latoschik. Avatar embodiment

realism and virtual fitness training. In 2015 IEEE Virtual Reality (VR),

pp. 225–226. IEEE, 2015.

[22] K. Mammou. PCC test model category 2 v0. ISO/IEC JTC1/SC29/
WG11 N17248, 1, 2017.

[23] R. Mekuria, K. Blom, and P. Cesar. Design, implementation, and

evaluation of a point cloud codec for tele-immersive video. IEEE
Transactions on Circuits and Systems for Video Technology, 27(4):828–
842, 2017.

[24] S. Narang, A. Best, A. Feng, S.-h. Kang, D. Manocha, and A. Shapiro.

Motion recognition of self and others on realistic 3D avatars. Computer
Animation and Virtual Worlds, 28(3-4):e1762, 2017.

[25] S. Orts-Escolano, C. Rhemann, S. Fanello, W. Chang, A. Kowdle,

Y. Degtyarev, D. Kim, P. L. Davidson, S. Khamis, M. Dou, et al. Holo-

portation: Virtual 3d teleportation in real-time. In Proceedings of the
29th Annual Symposium on User Interface Software and Technology,
pp. 741–754. ACM, 2016.

[26] A.-F. Perrin, C. Bist, R. Cozot, and T. Ebrahimi. Measuring quality of

omnidirectional high dynamic range content. In Applications of Digital
Image Processing XL, vol. 10396, p. 1039613. International Society
for Optics and Photonics, 2017.

[27] R. D. Queiroz and P. A. Chou. Compression of 3D Point Clouds Using

a Region-Adaptive Hierarchical Transform. IEEE Transactions on
Image Processing 25, June 2016.

[28] D. Roth, K. Waldow, M. E. Latoschik, A. Fuhrmann, and G. Bente.

Socially immersive avatar-based communication. In 2017 IEEE Virtual
Reality (VR), pp. 259–260. IEEE, 2017.

[29] P. N. Satgunam, R. L. Woods, P. M. Bronstad, and E. Peli. Factors

affecting enhanced video quality preferences. IEEE Transactions on
Image Processing, 22(12):5146–5157, 2013.

[30] O. Schreer, I. Feldmann, T. Ebner, S. Renault, C. Weissig, D. Tatzelt,

and P. Kauff. Advanced volumetric capture and processing. SMPTE
Motion Imaging Journal, 128(5):18–24, 2019.

[31] T. W. Schubert. The sense of presence in virtual environments: A

three-component scale measuring spatial presence, involvement, and

realness. Zeitschrift für Medienpsychologie, 15(2):69–71, 2003.
[32] S. Schwarz, M. Preda, V. Baroncini, M. Budagavi, P. Cesar, P. A. Chou,

R. A. Cohen, M. Krivokuća, S. Lasserre, Z. Li, et al. Emerging MPEG

standards for point cloud compression. IEEE Journal on Emerging and
Selected Topics in Circuits and Systems, 9(1):133–148, 2018.

[33] M. Seymour, K. Riemer, and J. Kay. Actors, avatars and agents:

potentials and implications of natural face technology for the creation

of realistic visual presence. Journal of the Association for Information
Systems, 19(10):953–981, 2018.

[34] M. Slater and M. V. Sanchez-Vives. Enhancing our lives with immer-

sive virtual reality. Frontiers in Robotics and AI, 3:74, 2016.
[35] D. Tian, H. Ochimizu, C. Feng, R. Cohen, and A. Vetro. Geometric

distortion metrics for point cloud compression. In 2017 IEEE Interna-
tional Conference on Image Processing (ICIP), pp. 3460–3464. IEEE,
2017.

[36] E. M. Torlig, E. Alexiou, T. A. Fonseca, R. L. de Queiroz, and

T. Ebrahimi. A novel methodology for quality assessment of vox-

elized point clouds. In Applications of Digital Image Processing XLI,
vol. 10752, p. 107520I. International Society for Optics and Photonics,

2018.

[37] H. TT Tran, N. P. Ngoc, C. T. Pham, Y. J. Jung, and T. C. Thang. A

subjective study on user perception aspects in virtual reality. Applied
Sciences, 9(16):3384, 2019.

[38] I. Viola and T. Ebrahimi. A new framework for interactive quality

assessment with application to light field coding. In Applications of
Digital Image Processing XL, vol. 10396, p. 103961F. International
Society for Optics and Photonics, 2017.

[39] J. O. Wobbrock, L. Findlater, D. Gergle, and J. J. Higgins. The

Aligned Rank Transform for nonparametric factorial analyses using

only ANOVA procedures. In Proceedings of the SIGCHI conference
on human factors in computing systems, pp. 143–146. ACM, 2011.

[40] E. Zerman, P. Gao, C. Ozcinar, and A. Smolic. Subjective and objective

quality assessment for volumetric video compression. In Fast track
article for IST International Symposium on Electronic Imaging 2019:
Image Quality and System Performance XVI proceedings, 2019.

[41] C. Zhang, D. Florencio, and C. Loop. Point cloud attribute com-

pression with graph transform. Image Processing (ICIP), 2014 IEEE

135


International Conference on, October 2014.
[42] J. Zhang, W. Huang, X. Zhu, and J.-N. Hwang. A subjective quality

evaluation for 3D point cloud models. In 2014 International Conference
on Audio, Language and Image Processing, pp. 827–831. IEEE, 2014.

136