key: cord-0566995-msehxuc3
authors: Ryder, Tom; Zhang, Chen; Kang, Ning; Zhang, Shifeng
title: Split Hierarchical Variational Compression
date: 2022-04-05
journal: nan
DOI: nan
sha: 201b222a40a671d779702fae3f3168b6274f1d73
doc_id: 566995
cord_uid: msehxuc3

Variational autoencoders (VAEs) have witnessed great success in performing the compression of image datasets. This success, made possible by the bits-back coding framework, has produced competitive compression performance across many benchmarks. However, despite this, VAE architectures are currently limited by a combination of coding practicalities and compression ratios. That is, not only do state-of-the-art methods, such as normalizing flows, often demonstrate out-performance, but the initial bits required in coding makes single and parallel image compression challenging. To remedy this, we introduce Split Hierarchical Variational Compression (SHVC). SHVC introduces two novelties. Firstly, we propose an efficient autoregressive prior, the autoregressive sub-pixel convolution, that allows a generalisation between per-pixel autoregressions and fully factorised probability models. Secondly, we define our coding framework, the autoregressive initial bits, that flexibly supports parallel coding and avoids -- for the first time -- many of the practicalities commonly associated with bits-back coding. In our experiments, we demonstrate SHVC is able to achieve state-of-the-art compression performance across full-resolution lossless image compression tasks, with up to 100x fewer model parameters than competing VAE approaches.

The volume of data, measured in terms of IP traffic, is currently witnessing an exponential year-on-year growth [9] . Fuelled by the demand for high-resolution media content, it is estimated that 80% of this data is in the form of images and video [9] . Data service providers, such as cloud and streaming platforms, have consequently seen costs associated with transmission and storage become prohibitively expensive. For example, an increased demand for streaming services forced major providers to throttle the maximum resolution of video content to 720p during the coronavirus pandemic. As such, these challenges have renewed the need for the development of high-performance data compression codecs. * co-first author. The work of Tom Ryder is conducted during his employment at Huawei Technologies R&D UK.

One solution to this problem has been the development of approaches using likelihood-based generative models capable of discrete density estimation [3, 6, 13, 14, 18, 22, 24, 35, 36, 42, 43] . Such methods operate by learning a deep probabilistic model of the data distribution, which, in combination with entropy coders, can be used to compress data. Here, according to Shannon's source coding theorem [21] , the minimal required average codelength is bounded by the expected negative log-likelihood of the data distribution.

From this family of generative models, there have emerged three dominant modes for data compression: normalizing flows [3, 14, 42, 43] , variational autoencoders [18, 24, 36] and autoregressive models [15, 31, 37] 1 . In fact, each of these approaches can be thought of as a traversal on the Pareto frontier of inference speed and compression performance. With broad generality, autoregressive models can often be the most powerful but the slowest; variational autoencoders are often the weakest but the fastest; and normalizing flows -depending on the variant -sit somewhere in between.

In this paper, we consider data compression with VAEs, and focus on extending the efficient frontier; obtaining solutions faster than popular VAEs that achieve state-of-the-art compression ratios. Use of VAEs, however, poses two outstanding challenges. Firstly, we should achieve competitive coding ratios without greatly sacrificing time complexity. For example, best iterates currently require one of two ingredients to improve performance: building either a deep hierarchy of latent variables [8] or use of autoregressive priors [11, 29] . The latter idea, especially popular in the codecs of the lossy compression community [25] , posits a model that flexibly learns both local (via autoregression) and global (via hierarchical latent representation) data modalities (e.g. low-frequency information). Whilst these approaches, such as MS-PixelCNN [29] and PixelVAE [11] , have had some success in achieving more efficient trade-offs, generation of even moderately sized images is still to the order of minutes [22] .

Secondly, there should exist a practical means by which to efficiently perform single-image compression. Single-image compression then permits parallel coding, which is highly desirable. However, translating a VAE into a lossless codec is currently achieved using the bits-back coding framework (predominantly, bits-back ANS), which requires a large number of initial bits [12, 36, 39] (see Section 3.1) . Whilst this is a trivial number of bits on large image datasets (where we can amortize this cost), it renders bits-back an impractical approach for single-image compression. Furthermore, even large datasets are often coded such that images are interlinked. Access to a single image in the middle of a sequence would therefore require all prior images in the bitstream to be additionally decompressed.

To that end, we propose two novelties for use in VAE-based compression designed to address these challenges. The first, our autoregressive sub-pixel convolution, introduces a simple autoregressive factorisation -not dissimilar from the transformations used in normalizing flows [3, 42, 43] -designed to present an efficient interpolation between fully-factorised probability distributions and the impractical per-pixel autoregressions. Built from a modified space-to-depth convolution operator, we losslessly downsample data variables before performing a computationally efficient autoregression along the channel dimension. Our autoregressive operator is then advantaged by a number of network evaluations invariant to data dimensions, with each autoregression crucially performed on a downsampled version on the input tensor. More broadly, we view this framework as a generalisation of many popular autoregressive "context" models used in data compression [11, 26, 31, 41] .

Our second contribution, autoregressive initial bits, presents a general framework for avoiding the impracticalities of bits-back ANS, allowing for eminently parallelizable coding. This technique, highly compatible with our autoregressive model, partitions the data variable into two splits such that the second partition is conditionally independent of the latent variable(s), given the first. In this way, we illustrate how we can use the entropy coding of the conditionally independent partition to both supply and remove the initial bits necessitated by bits-back ANS. We demonstrate that this approach reduces the bit overhead on a per-image basis by close to 20x.

Finally, we combine the above contributions to present our codec, Split Hierarchical Variational Compression (SHVC). SHVC posits a hierarchical VAE of general-form autoregressive priors that permits parallel coding. Using our framework, we outperform all other VAE-based compression approaches with fewer latent variables and a comparable number of neural network evaluations. We further illustrate the effectiveness of our architecture by training a small model which is able to outperform similar VAE approach Bit-Swap [18] -but with 100x fewer model parameters.

Compression with VAEs can be separated into those with [18, 36] and those without [22] stochastic posterior sampling (the latter uses a discrete distribution of one symbol, assumed to have a probability of one). Whilst obtaining theoretically superior compression ratios, approaches adopting stochastic posteriors, such as HiLLoC [36] and Bit-Swap [18] , must entropy code using derivatives of the bits-back argument [12] . These approaches, under the umbrella of bits-back ANS (bb-ANS), require access to an initial bitstream. Although it is possible to amortize the cost of the initial bitstream across a large dataset, single and parallel data compression has presently proven challenging (see Section 3.1). In HiLLoC [36] the authors propose use of a conventional codec to compress and send parts of a dataset, which is then used as the basis for the initial bitstream. Whilst this partially solves some of the coding challenges, it requires the implementation of sub-optimal traditional techniques and still does not permit practical coding of a single image. In contrast, our approach avoids all of these challenges, requiring fewer latent variables, for a negligible additional bit cost. Similarly, approaches such as L3C [22] that leverage deterministic posteriors can also avoid the challenges associated with bb-ANS with use of arithmetic (or adaptive arithmetic) coding (AC) [40] . Closely related approach RC [24] -which can be loosely thought of as a VAE -uses a lossy compressed image as a de facto latent variable to condition the data distribution. However these techniques pay for their comparative practicality with a penalty in compression ratio as they must explicitly code the joint distribution of data and latent variables (see Section 3.2). Since this work focuses on VAEs based codecs, we refer readers to [3, 14, 42, 43] and references therein for compression with alternative deep generative models.

Autoregressive Models are a popular means to extend the independence assumptions of fully factorised models to highdimensional multivariate densities. They are popular as both stand-alone models [31, 38] or in application with VAEs [11, 25, 29] . In their most computationally expensive forms, such as Pix-elCNN++ (and variants) [31] , they posit a pixel-by-pixel autoregression, which then codes in raster-scan order. Whilst of broad academic interest, the O(n 2 ) time complexity makes them prohibitive for application in data compression. One proposed solution to this problem has been to parametrise the priors of VAEs with autoregressive densities. Here probability estimation proceeds by combining hierarchical latent representations with decoded autoregressive context. Supplementing autoregressive components with auxiliary latent features permits causality restrictions that reduce time complexity, without greatly diminishing performance. These restrictions include channel-wise autoregressions [26] ; independent, block-based models [29] ; "checkerboard" context [41] ; and small neural networks [11] , amongst others. In fact, these restrictions are likely of dual purpose: combining powerful autoregressive models with VAEs will likely expedite posterior collapse (see Section 5.3) [5, 11, 20, 28] . Like these techniques, our approach combines a VAE with a restricted autoregressive model. Our method can be thought as most similar to [26] and [41] . Like the former, we perform a channel-wise autoregression, but do so after our autoregressive operator downsamples the data tensor. As such, our causality more closely re- Here we nest a bb-ANS coder inside of a block-based autoregressive structure that removes the need for initial bits. sembles that of [41] . However, in contrast to the authors of [41] , who enforce their causality with a binary mask, we do so using our sub-pixel convolution. This precipitates a greater degree of parallelism and presents the flexibility to efficiently recover a number of causal dependency schemes, such as PixelCNN++.

Suppose access to a dataset of size n, {x 1 , x 2 , ... , x n }, drawn from some intractable p data (x) that we wish to compress. In order to achieve this, we introduce a discrete probability distribution p(x) that, in combination with entropy coding, requires a codelength of n i=1 −log 2 p(x i ) bits to represent. Ideally, p(x) should closely resemble p data (x). In such a case, the average codelength in the limit of n −→ ∞ is given by E pdata [−log 2 p(x)]−→H(x), where H(x) is the entropy of the data. Here, the compression scheme is said to be optimal under Shannon's source coding theorem [32] .

As discussed in Section 1, variational autoencoders (VAEs) are one popular approach to estimating p data (x), which defines a latent variable model such that

where p(z) is the prior distribution over latent variable z. As p(x) is normally intractable, VAEs introduce an approximate posterior q(z|x), which is optimised to maximise a lower bound on the marginal evidence, the Evidence Lower Bound (ELBO)

where low-variance estimates of the expectation in (2) are via Monte-Carlo integration and the reparametrization trick [17] .

Entropy coding requires explicit probabilities of data symbols. However, in VAEs, the model is factorised into a prior and a likelihood, and therefore does not allow direct coding of the data. To remedy this, several authors have proposed coding variants of bb-ANS [18, 35, 36] . This process is outlined as follows, which, without loss of generality, we describe for a model with a single latent variable. During compression, one decodes z from some auxiliary initial bits with q(z|x); encodes x with p(x|z); and encodes z with p(z) to obtain the complete bitstream. In the decompression stage, one decodes z from the bitstream with p(z); decodes x with p(x|z); encodes z with q(z|x) and thus returns the initial bits (hence bits-back coding). This technique is visualised in Fg. 1 Left.

The first decoding step common in existing bb-ANS codecs requires access to an initial bitstream. This requirement leads to several disadvantages when compared to AC. Firstly, while the initial bits are returned after decompression, the same bits are occupied and not readily readable beforehand. Secondly, although we can amortize the cost of the initial bits across a large dataset by chaining the compressed data, access to any given data point requires decompression of all data points posterior to the one of our target in the original data sequence. As such, compression of a single image carries substantial overheadand, by extension, so too do parallel coding implementations.

Within VAEs based lossless compression, the impracticalities associated with bits-back coding are not the only choice. Indeed, approaches that use "deterministic" posterior sampling [22] may eschew bb-ANS for AC. This approach is almost ubiquitous in the lossless codecs of lossy approaches [2, 7, 25, 26] , where eminently parallelizable, low-latency codecs are especially preferable (e.g. streaming media). We note that when using deterministic posterior sampling, the likelihoods associated with the prior in Eq. (2) trivially cancel to zero, such that the objective resolves to maximum likelihood estimation of the L %3'6DYLQJ? joint distribution (see e.g. [2] ).

Whilst gaining notable practical coding advantages, in sacrificing a stochastic posterior one also sacrifices the ability to minimize the cost of sending latent variables. To offset this limitation, models will repeatedly downsample (RDS) the number of symbols available to latent representations between each layer. Whilst this limits posterior expressiveness, there is little experimental evidence to support how much this matters in practice. In addition, models with stochastic posteriors require large L to excel (where L is the number of latent variables), which hinders run-time.

To that end, we display the results of a simple experiment in Fg. 2 designed to investigate this difference further. Here we train three VAEs across CIFAR10, ImageNet32 and ImageNet64: two with stochastic posteriors (one with and one without RDS) and a deterministic posterior (with RDS). The architectures in each model are identical (with the exception of downsampling operations) and we quantify compression performance in bits per dimension (BPD). Further experimental details can be found in the Appendix. Here we observe that, even with L ≤ 3, both stochastic posteriors outperform by ∼ 5%. This difference grows as L increases -but it does not matter if RDS is used (at least, for L ≤ 5). This result is of important consequence: the best current approach to avoid the impracticalities of bb-ANS (ie. using a deterministic posterior) carries a 5% BPD penalty. For single-image compression with stochastic posteriors, the initial bits required would typically be much larger than this. Likewise, unless extending to a deep hierarchy of latent variables, RDS seems like a compute-efficient choice that does not limit performance.

Our method posits a hierarchical VAE where we parameterize the priors using an autoregressive factorisation. We begin by defining a lossless downsampling convolution operator, before describing its application to density estimation using both weak and strong autoregressive models. We then describe how this autoregressive structure can be leveraged to avoid many of the challenges associated with bb-ANS without sacrificing the performance of stochastic posteriors. Finally, we describe how these contributions can be combined to form our SHVC codec.

The space-to-depth and depth-to-space transformations are popular operations across image analysis, from generative modelling [3, 14] to super-resolution [33] . They define adjacent operations for efficient up and downsampling transformations by folding spatial dimensions into channel dimensions -and vice versa. Unlike learned operations, they greatly reduce computational complexity, allowing for greater parallelism by losslessly moving computation (and data) into the channels. Indeed, these operations have become an essential component in papers seeking real-time execution (e.g. [10, 19, 30, 33] ). Specifically, given a tensor of C channels, H height and W width, we define the space-to-depth and depth-to-space transformations, f and f −1 , such that

where k is the scale factor. As described in [33] , these operations can be efficiently performed using sub-pixel convolutions, which are referred to as pixel unshuffle and pixel shuffle. In particular, their space-to-depth transformation, pixel unshuffle, is performed using a k-stride depthwise convolution where the n th element of Ck 2 k×k filters has one non-zero element such that

where h,w are the indices over spatial dimensions. The result of this operation is visualised in Fg. 3 Centre. Defining a channel-wise autoregression over the resulting tensor would posit a checkerboard autoregressive structure over each of the channels in the original tensor, sequentially. However, as identified in PixelCNN++ [31] , sub-pixels in adjacent channels, sharing the same spatial location in the original tensor, have high correlation and therefore do not require complex models to describe the dependency structure. As such, the authors of PixelCNN++ use a linear model predicted by a single network evaluation, conditioned on decoded context, to define the joint distribution across channels. In this way, they obfuscate the need for separate RGB network evaluations. (We note that in our setting, context refers to previously decoded pixels in either the current or previous hierarchical latent variable.) From henceforth what we refer to as a weak autoregression, is then defined similarly to [31] according to p(x c,h,w |x <c,h,w ,D) (6) where D is the decoded context and p is some parametric probability mass function (pmf), obtained via integrating a probability density function (pdf) over discretization bins, with mean at channel c location h,w given by

Here α and β are scalars predicted for all channels and spatial locations by a single network evaluated on decoded context, and i is the index over channels in decoded context such that β (i) c,h,w is the scalar for prediction of the mean associated with pixel at channel i, spatial location h,w.

Inspired by this, we introduce a new space-to-depth convolution such that the resulting autoregression is alternatively re-ordered into k 2 sub-blocks of C channels each. Crucially, the resulting channels in each sub-block share the same spatial index allowing application of the autoregression detailed in (6) and (7). We note that should k =H =W we return an equivalence to the per-pixel autoregression of PixelCNN++ but perform an autoregression exclusively in the channel dimension. Likewise should k <H, we define a block-based context model in raster scan order where, unlike MS-PixelCNN, adjacent blocks are dependent.

To achieve our desired downsampling operation, which we denote by g(·), we expand the depthwise convolutions of (15) into regular three-dimensional kernels where the n th of Ck 2 C ×k×k filters has one non-zero element such that

We further visualise this operation in Fg. 3 Right. The resulting density of the downsampled tensor for spatial location h,w is then given by

where i is the index over sub-blocks (i.e. a strong autoregression evaluated using neural networks).

Masked 3D Convolutions Whilst we are restricted to k 2 evaluations per latent variable at inference time, the same does not have to be true during training. One efficient parallel training scheme is to use 3-dimensional convolutions applied to the downsampled tensors by expanding them into a d×Ck 2 ×H ×W volume [23, 27] , where d is some auxiliary dimension. Here we can apply zero-masking along the channel dimension of the kernels to enforce the causality condition, along with k-stride channel convolutions on the input. Full details are available in the Appendix.

Choice of Distribution For our choices of p and q, we use a discretized mixture of logistic distributions for x and a discretized univariate logistic distribution for all z (l) [31] . That is, given some mean µ, scale s and uniform discretization bin-width b, one can obtain the univariate pmf by integrating the logistic pdf over the discretiztion bin. For x, we typically use a mixture of 5 discrete logistic distributions as defined above.

As discussed in Section 3.1, bb-ANS is able to achieve efficient codelengths, but can lead to several shortcomings. Fortunately, our proposed autoregressive model naturally accommodates the possibility to bypass the auxiliary bits needed in other bb-ANS methods. We achieve this by exploiting the block-based autoregressive structure on the data variable. We outline this process below, which we refer to as autoregressive initial bits (ArIB). Different from the models considered in existing VAE-based codecs, we remove the direct causality between the latent variable z and some partition of the data variable x. In practice, we simply remove z from D in Eq. (9) for the final n sub-blocks in p(x|z), along with the partition from x in q z (1) |x . As a result, we factorise the likelihood as p(x|z)=p(x s+1:k 2 |x 1:s )p(x 1:s |z) with the approximate posterior as q(z|x)=q(z|x 1:s ), where s is our 'split' index. Instead of conducting the first step by decoding z from q(z|x), one can encode x s+1:k 2 with p(x s+1:k 2 |x 1:s ) and thus obtain the bitstream from which to decode z. Then one decodes z with q(z|x 1:s ), encodes x 1:s with p(x 1:s |z) and encodes z with p(z). At the decompression stage, one decodes z with p(z), decodes x 1:s with p(x 1:s |z), encodes z with q(z|x 1:s ) and decodes x s+1:k 2 with p(x s+1:k 2 |x 1:s ). We illustrate this technique in Fg. 1 Right.

For this approach to be valid, we require the satisfaction of two criteria:

1. There exists some s, k and z such that imposing (x s+1:k 2 ⊥z|x 1:s ) does not greatly hinder performance.

2. The entropy of p (x s+1:k 2 |x 1:s ) and q (ẑ|x 1:s ), wherê z is the discretized analogue of z, should be such that H p(x s+1:k 2 |x1:s) ≥H q(ẑ|x1:s) .

In our experiments, we demonstrate that the performance costs associated with criteria one are negligible. Crucially, we demonstrate that it is both orders of magnitude less that initial bits required of vanilla bb-ANS and a parameterisation of our approach using deterministic posteriors. For criteria two, we formulate the optimization of (2) as a constrained problem subject to H p(x s+1:k 2 |x1:s) ≥ H q(ẑ|x1:s) , where we estimate the respective expectations during training using Monte-Carlo integration. Whilst a variety of techniques from optimization theory may be applied, we found it sufficient to simply penalise (2) according to L pen =L+λmax 0,H q(ẑ|x1:s) −H p(x s+1:k 2 |x1:s) , (10) where λ is some Lagrange multiplier. We find that this further presents flexibility when choosing s, with a variety of choices yielding the same result.

SHVC formulates a hierarchical VAE built from the components described above. Here we partition the latent variable into a simple disjoint hierarchy of L layers, such that z ={z (1) ,...,z (L) }. We define the prior and posterior according to

where we parameterise every conditional density in Eq. (11) as per Eq. (9) . While this factorisation naturally fits the coding scheme proposed in Bit-Swap [18] , we additionally introduce a local reverse encoding to accommodate the autoregressive structure for factors in Eq. (11) . In more detail, for the encoding of

, one needs to encode in the reserved order of z (i) k 2 , ..., z (i) 1 , to accommodate the first-in-last-out nature of ANS based codecs.

For purposes of experimentation, we define two versions of our model: one with and one without the dependency structure permitting ArIB. From henceforth we shall refer to these models as SHVC and SHVC-ArIB, respectivley. For SHVC, one can encode x with p(x|z (1) ), along with other variables in Eq. (11) as discussed above. For SHVC-ArIB, one performs encoding and decoding for x as discussed in Section 4.2; and applies local reverse encoding for slices in x s+1:k 2 and x 1:s , respectively. We note that the only difference in SHVC-ArIB is that, whilst p(x s+1:k 2 |x 1:s ) and p(x 1:s |z (1) ) are both modelled using (9), the former evidently omits z (1) from D. In addition, we restrict the posterior such that q(z (1) |x 1:s ). We visualise the overall architecture along with the coding scheme for SHVC-ArIB in the Appendix.

In this Section, we perform a series of experiments to evaluate the effectiveness of compression with SHVC. We begin by discussing the architecture and training details in Section 5.1. In Section 5.2, we evaluate compression performance in terms on both low and full-resolution images. Here we additionally evaluate the effect of the ArIB constraints on compression performance and assess inference speed. Finally, in 5.3 we perform a series of ablation studies.

As discussed in Section 3, we observe an efficient trade-off between inference speed and performance where we repeatedly downsample and limit the number of latent variable layers. Inspired by this, we apply repeated downsampling, using three and four latent variables when training 32 × 32 and 64 × 64 images, respectively. With these settings, we require 3 and 16 network evaluations for posterior and prior inference on CIFAR10. In contrast, leading VAE approach HiLLoC [36] requires 24 prior and posterior evaluations.

For all of our experiments, we set k =2 in our autoregressive model. Whilst this choice is discussed more extensively in Section 5.3, we found the k = 2 presented an effective compromise between inference speed and representational power. Indeed, increasing k can often result in worse performance. We observe that increasing k increases the possibility of posterior collapse [5, 11, 20, 28] , making training a hierarchy of powerful autoregressive models challenging. We further detail comprehensive architectures for all of our experiments in the Appendix. 

Low-Resolution Images We begin by testing our method on three toy datasets: CIFAR10, Imagenet32 and Imagenet64. We compare our method against leading approaches from traditional codecs, normalizing flows and VAEs. (Given the time complexity associated with per-pixel autoregressive factorisations, we follow the broader compression community and eschew them from our comparisons). As an ad hoc test of generalizability, we additionally follow the authors of [14, 36, 42, 43 ] by training a model on Imagenet32 and testing it across all other datasets. We present our results in terms of bits per dimension (BPD), which we display in Table 1 .

We further compare our approach against full-resolution algorithms, i.e., L3C [22] . Here we follow [43] and adopt our Imagenet64 model using a patch-based evaluation protocol in which images are cropped to 64 × 64.

From Table 1 , we demonstrate reliable out-performance of every other considered codec. Further, when comparing the performance of SHVC to e.g. iFlow across small and full-resolution images we interestingly note that difference in out-performance becomes greater. Indeed, we hypothesise that autoregressive "context" becomes an increasingly important inductive bias as the resolution of the image increases -and likewise the extent of the spatial redundancy.

ArIB ArIB poses non-trivial constraints on the data variable, such that a sufficiently large partition should be: 1. conditionally independent of latent variables; 2. encoded with an entropy larger than that with which the first latent variable is decoded with. Using the objective of Eq. (10), we train models capable of single-image compression: SHVC, SHVC-ArIB and a parameterisation of SHVC with deterministic posteriors using AC (henceforth known as Deterministic SHVC). We display results across CIFAR10, Imagenet32 and Imagenet64 in Fg. 4, and quantity our results as additional bits on a per-image basis against SHVC. In more detail, the overhead in SHVC comes from initial bits, whilst other models incur a performance cost.

Here we see that our ArIB adds minimal additional bits -less than the number in SHVC by a factor of ∼20. Crucially, our approach also outperforms Deterministic SHVC, which we hope may serve as motivation to adapt the codec to lossy compressors. Inference Speed To better evaluate our approach in the context of VAEs, we compare SHVC against popular publicly available compression models, HiLLoC 2 [36] , Bit-Swap [18] and Integer Discrete Flow (IDF) [14] 3 . We report achieved BPDs in Table 2 and measure inference time in seconds (s) to evaluate the 10,000 CIFAR10 test images with a batch size of 100. Here we observe that our model is faster than HiLLoC for a lower BPD and achieves lower BPD for the same speed as Bit-Swap. We further note that, unlike both considered approaches, we are able to easily achieve parallel coding with minimal overhead. ,WHUDWLRQV %3'

2XUV 6SDFHWRGHSWK Figure 5 . The theoretically minimal BPD against iteration, comparing the common space-to-depth operation vs ours.

Space-to-depth operation One alternative in our approach is to replace our downsampling operator of (8) with the usual space-to-depth transformation described in Eq. (15) . However, as discussed, this would instead define a weak autoregressive property over spatially adjacent pixels, channel-by-channel. In Fg. 5 we demonstrate the differences resulting from the choice of spatial downsampling transformation by training two models on CIFAR10. Here we observe that our convolutional operator provides non-trivial benefits over the vanilla space-to-depth transformation.

Choice of k As discussed, one important hyper-parameter choice is that of k. As k grows, the prior becomes more powerful but the time complexity grows. In Fg. 6 we visualise the effect of increasing k on the compression ratio, which is displayed as an average across three models trained on CIFAR10, ImageNet32 and ImageNet64. As discussed, we note that performance of the model peaks at k =4, before becoming worse as k increases. This non-intuitive behaviour can be better explained further in Fg. 6, where we see evidence of a posterior collapse common in hierarchical VAEs -especially those with hierarchical autoregressive priors [5, 11, 20, 28] . 

We have proposed and evaluated a new VAE model for data compression. SHVC is able to outperform existing VAE approaches in terms of both speed and compression ratios. Additionally, unlike competing approaches using bb-ANS, SHVC-ArIB is able to support parallel coding with minimal overhead. As such, we believe ArIB represents the most promising means to convert a VAE into a lossless codec. Motivated by this, one promising area of future work could consider the application of SHVC to lossy compression, which has traditionally ignored bits-back schemes.

Whilst our approach does not introduce any new negative societal consequences, we remain acutely aware of the issues precipitated by this area of research. These concerns are two-fold. Firstly, improving the capacity to store data could greatly increase the amount of non-essential personal data held by third-parties. This may be viewed as a challenge on broader digital liberties. Secondly, any class of generative model trained on sensitive data will learn to closely approximate the distribution of that data. As such, the model itself might extend to malignant use-cases beyond its intended purpose., e.g. classifiers.

For both the encoders and decoders, we use four fullyconnected two-dimensional convolutional layers with 128 channels and a 3x3 kernel. We additionally use weightnormalization at each layer and PReLU activation units. Where relevant, we downsample with 2-stride convolutions at the third convolution, and upsample using a transposed convolution at the third convolution.

For training, we use the Adam optimizer with default learning settings and an initial learning rate of 5×10 −4 . We exponentially anneal this learning rate to 1×10 −5 during training.

Unlike stochastic posterior sampling, where we can train with continuous latent variables because discretization schemes cancel across distributions, deterministic posterior sampling requires discretization during training. Because the discrteization operation (i.e. rounding) is not differentiable, we adopt the popular technique of adding uniform noise during training, such that our discretized latent variable is defined bŷ

where δ is the uniform discretization bin and U represents a uniform distribution. In practice, we take δ =1.

For large k it becomes impractical to train using twodimensional convolutions. Doing so typically necessitates a serial scheme across data partitions at a given latent variable. One approach to train our models in parallel is to use masked three-dimensional convolutions. We achieve this by expanding our downsampled data tensors into a d×Ck 2 ×H ×W volume, where d is some auxiliary dimension.

In order to retain the causality constraints, we build our approach of two steps:

1. We use an off-center convolution of stride C to enforce the autoregressive structure within sub-blocks. We define this operation as one that concatenates a zero-tensor of dimension d×C ×H ×W to the data variable along the auxiliary dimension and then applies the convolution as described.

The result of applying this convolution is a f ×k 2 ×H ×W tensor, where f are the output channels of the convolution.

2. We then apply repeated masked three-dimensional convolutions to the output of the the off-center convolution. To enforce the causality constraint between sub-blocks we apply a point-wise mask to the kernels prior to convolution. We define two types of masks: type 'A' and 'B'. We use Btype masks at all locations apart from the input, where use an A-type mask. We describe these masks in more detail below, and visualise them for a 3×3×3 kernel in Figure 7 .

For a three-dimensional kernel of depth d, height h and width w, consider the following masks that we apply as a point-wise multiplication to the kernel.

B-Type Mask 

Variable Dependencies In Figure 8 and Figure 9 , we illustrate the differences in the dependency structures in the priors and posteriors of SHVC and SHVC-ArIB. For ease of presentation, we do so using one latent variable (i.e. L=1) and assume k =2. We further assume z (1) is two-times smaller in spatial resolution that x but has the same number of channels, i.e. C =3.

Coding Schemes Here we use the above model specification as an example to illustrate the encoding and decoding processes of SHVC and SHVC-ArIB. Encoding and decoding algorithms for SHVC can be found in Algorithms 1 and 2. Encoding and decoding algorithms for SHVC-ArIB can be found in Algorithms 3 and 4. At the global level, the coding algorithm is consistent with that of Bit-Swap, and at the local level, the encoding of slices in latent and the data is conducted in the reverse order. Since the above model only involves one latent variable, the global level Bit-Swap degenerates to the original bb-ANS.

For both our encoder and decoder architectures, we use 8-layer Residual networks with PReLU activation units and weight-normalization. For CIFAR10 we additionally use dropout layers between residual connections to prevent overfitting.

To highlight the effectiveness of our approach, we additionally train a small "Lite" model, which uses four fully-connected convolutional layers. Here we reduce the number of channels as we downsample the latent variables across layers. These are detailed as follows:

• p(x|z (1) ) uses 32 channels.

• p(z (1) |z (2) ) uses 24 channels.

• p(z (2) |z (3) ) uses 16 channels.

• p(z (3) |z (4) ) uses 8 channels, if it exists.

For training, we use the Adam optimizer with default learning settings and an initial learning rate of 5×10 −4 . We exponentially anneal this learning rate to 1×10 −5 during training. We further use gradient-clipping to control for numerical stability.

We run all of our experiments on a single NVIDIA Tesla V100. Figure 9 . A comparison of the factorisation used in the posteriors of SHVC (left) and SHVC-ArIB (right). Variable groupings are represented by coloured blocks. Arrows indicate explicit dependencies. In SHVC-ArIB, there is no direct link between z (1) (blue) and x7:12 (green).

Input: data to compress x

Step 0: Get auxiliary initial bits c 0

Step 1: Decode z (1) with q(z (1) |x)

Step 2: Encode x with p(x|z (1) ) Encode x 12 with p(x 12 |x 1:11 ,z (1) ) Encode x 11 with p(x 11 |x 1:10 ,z (1) ) Encode x 10 with p(x 10 |x 1:9 ,z (1) ) ... Encode x 2 with p(x 2 |x 1 ,z (1) ) Encode x 1 with p(x 1 |z (1) ) Step 3: Encode z (1) with p(z (1) ) Encode z Step 2: Decode x with p(x|z (1) ) Decode x 1 with p(x 1 |z (1) ) Decode x 2 with p(x 2 |x 1 ,z (1) ) ... Decode x 10 with p(x 10 |x 1:9 ,z (1) ) Decode x 11 with p(x 11 |x 1:10 ,z (1) ) Decode x 12 with p(x 12 |x 1:11 ,z (1) ) Step 3: Encode z (1) with q(z (1) |x) Output: auxiliary initial bit stream c 0 , data to decompress x

Input: data to compress x

Step 0: Get autoregressive initial bits by encoding x 7:12 Encode x 12 with p(x 12 |x 1:11 ) ... Encode x 7 with p(x 7 |x 1:6 ) Step 1: Decode z (1) with q(z (1) |x 1:6 )

Step 2: Encode x 1:6 with p(x 1:6 |z (1) ) Encode x 6 with p(x 6 |x 1:5 ,z (1) ) ... Encode x 1 with p(x 1 |z (1) ) Step 3: Encode z (1) with p(z (1) ) Encode z 

Jpeg xl next-generation image compression architecture and coding tools

Variational image compression with a scale hyperprior

IDF++: Analyzing and improving integer discrete flows for lossless compression

Png (portable network graphics) specification version 1.0. Network Working Group

Generating sentences from a continuous space

Lossless image compression through super-resolution

Learned image compression with discretized gaussian mixture likelihoods and attention modules

Very deep VAEs generalize autoregressive models and can outperform them on images

Cisco visual networking index: global mobile data traffic forecast update

Salsanext: fast, uncertainty-aware semantic segmentation of lidar point clouds for autonomous driving

A latent variable model for natural images. International Conference on Learning Representations

Keeping neural networks simple by minimising the description length of weights

Compression with flows via local bits-back coding

Integer discrete flows and lossless compression

Distribution augmentation for generative modeling

Variational diffusion models

Auto-encoding variational bayes

Bit-swap: Recursive bits-back coding for lossless compression with hierarchical latent variables

Deep image compression via end-to-end learning

Don't blame the elbo! a linear vae perspective on posterior collapse

Information theory, inference and learning algorithms

Practical full resolution learned lossless image compression

Conditional probability models for deep image compression

Learning better lossless compression using lossy compression

Joint autoregressive and hierarchical priors for learned image compression

Channel-wise autoregressive entropy models for learned image compression

Pixel recurrent neural networks

Preventing posterior collapse with delta-VAEs

Parallel multiscale autoregressive density estimation

Elf-vc: Efficient learned flexible-rate video coding

Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications

A mathematical theory of communication

Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

Flif: Free lossless image format based on maniac compression

Practical lossless compression with latent variables using bits back coding

Hilloc: Lossless image compression with hierarchical latent variable models

Conditional image generation with pixelcnn decoders

Pixel recurrent neural networks

Classification by minimum-message-length inference

Arithmetic coding for data compression

Lossless image compression using a multi-scale progressive statistical model

iflow: Numerically invertible flows for efficient lossless compression via a uniform coder

ivpf: Numerical invertible volume preserving flow for efficient lossless compression

Step 3: Encode z (1) with q(z (1) |x 1:6 ) Step 4: Decode x 7:12 with p(x 7:12 |x 1:6 ) Decode x 7 with p

Decode x 12 with p(x 12 |x 1:11 ) Output: data to decompress x