key: cord-0566995-msehxuc3 authors: Ryder, Tom; Zhang, Chen; Kang, Ning; Zhang, Shifeng title: Split Hierarchical Variational Compression date: 2022-04-05 journal: nan DOI: nan sha: 201b222a40a671d779702fae3f3168b6274f1d73 doc_id: 566995 cord_uid: msehxuc3 Variational autoencoders (VAEs) have witnessed great success in performing the compression of image datasets. This success, made possible by the bits-back coding framework, has produced competitive compression performance across many benchmarks. However, despite this, VAE architectures are currently limited by a combination of coding practicalities and compression ratios. That is, not only do state-of-the-art methods, such as normalizing flows, often demonstrate out-performance, but the initial bits required in coding makes single and parallel image compression challenging. To remedy this, we introduce Split Hierarchical Variational Compression (SHVC). SHVC introduces two novelties. Firstly, we propose an efficient autoregressive prior, the autoregressive sub-pixel convolution, that allows a generalisation between per-pixel autoregressions and fully factorised probability models. Secondly, we define our coding framework, the autoregressive initial bits, that flexibly supports parallel coding and avoids -- for the first time -- many of the practicalities commonly associated with bits-back coding. In our experiments, we demonstrate SHVC is able to achieve state-of-the-art compression performance across full-resolution lossless image compression tasks, with up to 100x fewer model parameters than competing VAE approaches. The volume of data, measured in terms of IP traffic, is currently witnessing an exponential year-on-year growth [9] . Fuelled by the demand for high-resolution media content, it is estimated that 80% of this data is in the form of images and video [9] . Data service providers, such as cloud and streaming platforms, have consequently seen costs associated with transmission and storage become prohibitively expensive. For example, an increased demand for streaming services forced major providers to throttle the maximum resolution of video content to 720p during the coronavirus pandemic. As such, these challenges have renewed the need for the development of high-performance data compression codecs. * co-first author. The work of Tom Ryder is conducted during his employment at Huawei Technologies R&D UK. One solution to this problem has been the development of approaches using likelihood-based generative models capable of discrete density estimation [3, 6, 13, 14, 18, 22, 24, 35, 36, 42, 43] . Such methods operate by learning a deep probabilistic model of the data distribution, which, in combination with entropy coders, can be used to compress data. Here, according to Shannon's source coding theorem [21] , the minimal required average codelength is bounded by the expected negative log-likelihood of the data distribution. From this family of generative models, there have emerged three dominant modes for data compression: normalizing flows [3, 14, 42, 43] , variational autoencoders [18, 24, 36] and autoregressive models [15, 31, 37] 1 . In fact, each of these approaches can be thought of as a traversal on the Pareto frontier of inference speed and compression performance. With broad generality, autoregressive models can often be the most powerful but the slowest; variational autoencoders are often the weakest but the fastest; and normalizing flows -depending on the variant -sit somewhere in between. In this paper, we consider data compression with VAEs, and focus on extending the efficient frontier; obtaining solutions faster than popular VAEs that achieve state-of-the-art compression ratios. Use of VAEs, however, poses two outstanding challenges. Firstly, we should achieve competitive coding ratios without greatly sacrificing time complexity. For example, best iterates currently require one of two ingredients to improve performance: building either a deep hierarchy of latent variables [8] or use of autoregressive priors [11, 29] . The latter idea, especially popular in the codecs of the lossy compression community [25] , posits a model that flexibly learns both local (via autoregression) and global (via hierarchical latent representation) data modalities (e.g. low-frequency information). Whilst these approaches, such as MS-PixelCNN [29] and PixelVAE [11] , have had some success in achieving more efficient trade-offs, generation of even moderately sized images is still to the order of minutes [22] . Secondly, there should exist a practical means by which to efficiently perform single-image compression. Single-image compression then permits parallel coding, which is highly desirable. However, translating a VAE into a lossless codec is currently achieved using the bits-back coding framework (predominantly, bits-back ANS), which requires a large number of initial bits [12, 36, 39] (see Section 3.1) . Whilst this is a trivial number of bits on large image datasets (where we can amortize this cost), it renders bits-back an impractical approach for single-image compression. Furthermore, even large datasets are often coded such that images are interlinked. Access to a single image in the middle of a sequence would therefore require all prior images in the bitstream to be additionally decompressed. To that end, we propose two novelties for use in VAE-based compression designed to address these challenges. The first, our autoregressive sub-pixel convolution, introduces a simple autoregressive factorisation -not dissimilar from the transformations used in normalizing flows [3, 42, 43] -designed to present an efficient interpolation between fully-factorised probability distributions and the impractical per-pixel autoregressions. Built from a modified space-to-depth convolution operator, we losslessly downsample data variables before performing a computationally efficient autoregression along the channel dimension. Our autoregressive operator is then advantaged by a number of network evaluations invariant to data dimensions, with each autoregression crucially performed on a downsampled version on the input tensor. More broadly, we view this framework as a generalisation of many popular autoregressive "context" models used in data compression [11, 26, 31, 41] . Our second contribution, autoregressive initial bits, presents a general framework for avoiding the impracticalities of bits-back ANS, allowing for eminently parallelizable coding. This technique, highly compatible with our autoregressive model, partitions the data variable into two splits such that the second partition is conditionally independent of the latent variable(s), given the first. In this way, we illustrate how we can use the entropy coding of the conditionally independent partition to both supply and remove the initial bits necessitated by bits-back ANS. We demonstrate that this approach reduces the bit overhead on a per-image basis by close to 20x. Finally, we combine the above contributions to present our codec, Split Hierarchical Variational Compression (SHVC). SHVC posits a hierarchical VAE of general-form autoregressive priors that permits parallel coding. Using our framework, we outperform all other VAE-based compression approaches with fewer latent variables and a comparable number of neural network evaluations. We further illustrate the effectiveness of our architecture by training a small model which is able to outperform similar VAE approach Bit-Swap [18] -but with 100x fewer model parameters. Compression with VAEs can be separated into those with [18, 36] and those without [22] stochastic posterior sampling (the latter uses a discrete distribution of one symbol, assumed to have a probability of one). Whilst obtaining theoretically superior compression ratios, approaches adopting stochastic posteriors, such as HiLLoC [36] and Bit-Swap [18] , must entropy code using derivatives of the bits-back argument [12] . These approaches, under the umbrella of bits-back ANS (bb-ANS), require access to an initial bitstream. Although it is possible to amortize the cost of the initial bitstream across a large dataset, single and parallel data compression has presently proven challenging (see Section 3.1). In HiLLoC [36] the authors propose use of a conventional codec to compress and send parts of a dataset, which is then used as the basis for the initial bitstream. Whilst this partially solves some of the coding challenges, it requires the implementation of sub-optimal traditional techniques and still does not permit practical coding of a single image. In contrast, our approach avoids all of these challenges, requiring fewer latent variables, for a negligible additional bit cost. Similarly, approaches such as L3C [22] that leverage deterministic posteriors can also avoid the challenges associated with bb-ANS with use of arithmetic (or adaptive arithmetic) coding (AC) [40] . Closely related approach RC [24] -which can be loosely thought of as a VAE -uses a lossy compressed image as a de facto latent variable to condition the data distribution. However these techniques pay for their comparative practicality with a penalty in compression ratio as they must explicitly code the joint distribution of data and latent variables (see Section 3.2). Since this work focuses on VAEs based codecs, we refer readers to [3, 14, 42, 43] and references therein for compression with alternative deep generative models. Autoregressive Models are a popular means to extend the independence assumptions of fully factorised models to highdimensional multivariate densities. They are popular as both stand-alone models [31, 38] or in application with VAEs [11, 25, 29] . In their most computationally expensive forms, such as Pix-elCNN++ (and variants) [31] , they posit a pixel-by-pixel autoregression, which then codes in raster-scan order. Whilst of broad academic interest, the O(n 2 ) time complexity makes them prohibitive for application in data compression. One proposed solution to this problem has been to parametrise the priors of VAEs with autoregressive densities. Here probability estimation proceeds by combining hierarchical latent representations with decoded autoregressive context. Supplementing autoregressive components with auxiliary latent features permits causality restrictions that reduce time complexity, without greatly diminishing performance. These restrictions include channel-wise autoregressions [26] ; independent, block-based models [29] ; "checkerboard" context [41] ; and small neural networks [11] , amongst others. In fact, these restrictions are likely of dual purpose: combining powerful autoregressive models with VAEs will likely expedite posterior collapse (see Section 5.3) [5, 11, 20, 28] . Like these techniques, our approach combines a VAE with a restricted autoregressive model. Our method can be thought as most similar to [26] and [41] . Like the former, we perform a channel-wise autoregression, but do so after our autoregressive operator downsamples the data tensor. As such, our causality more closely re- Here we nest a bb-ANS coder inside of a block-based autoregressive structure that removes the need for initial bits. sembles that of [41] . However, in contrast to the authors of [41] , who enforce their causality with a binary mask, we do so using our sub-pixel convolution. This precipitates a greater degree of parallelism and presents the flexibility to efficiently recover a number of causal dependency schemes, such as PixelCNN++. Suppose access to a dataset of size n, {x 1 , x 2 , ... , x n }, drawn from some intractable p data (x) that we wish to compress. In order to achieve this, we introduce a discrete probability distribution p(x) that, in combination with entropy coding, requires a codelength of n i=1 −log 2 p(x i ) bits to represent. Ideally, p(x) should closely resemble p data (x). In such a case, the average codelength in the limit of n −→ ∞ is given by E pdata [−log 2 p(x)]−→H(x), where H(x) is the entropy of the data. Here, the compression scheme is said to be optimal under Shannon's source coding theorem [32] . As discussed in Section 1, variational autoencoders (VAEs) are one popular approach to estimating p data (x), which defines a latent variable model such that where p(z) is the prior distribution over latent variable z. As p(x) is normally intractable, VAEs introduce an approximate posterior q(z|x), which is optimised to maximise a lower bound on the marginal evidence, the Evidence Lower Bound (ELBO) where low-variance estimates of the expectation in (2) are via Monte-Carlo integration and the reparametrization trick [17] . Entropy coding requires explicit probabilities of data symbols. However, in VAEs, the model is factorised into a prior and a likelihood, and therefore does not allow direct coding of the data. To remedy this, several authors have proposed coding variants of bb-ANS [18, 35, 36] . This process is outlined as follows, which, without loss of generality, we describe for a model with a single latent variable. During compression, one decodes z from some auxiliary initial bits with q(z|x); encodes x with p(x|z); and encodes z with p(z) to obtain the complete bitstream. In the decompression stage, one decodes z from the bitstream with p(z); decodes x with p(x|z); encodes z with q(z|x) and thus returns the initial bits (hence bits-back coding). This technique is visualised in Fg. 1 Left. The first decoding step common in existing bb-ANS codecs requires access to an initial bitstream. This requirement leads to several disadvantages when compared to AC. Firstly, while the initial bits are returned after decompression, the same bits are occupied and not readily readable beforehand. Secondly, although we can amortize the cost of the initial bits across a large dataset by chaining the compressed data, access to any given data point requires decompression of all data points posterior to the one of our target in the original data sequence. As such, compression of a single image carries substantial overheadand, by extension, so too do parallel coding implementations. Within VAEs based lossless compression, the impracticalities associated with bits-back coding are not the only choice. Indeed, approaches that use "deterministic" posterior sampling [22] may eschew bb-ANS for AC. This approach is almost ubiquitous in the lossless codecs of lossy approaches [2, 7, 25, 26] , where eminently parallelizable, low-latency codecs are especially preferable (e.g. streaming media). We note that when using deterministic posterior sampling, the likelihoods associated with the prior in Eq. (2) trivially cancel to zero, such that the objective resolves to maximum likelihood estimation of the L %3'6DYLQJ? joint distribution (see e.g. [2] ). Whilst gaining notable practical coding advantages, in sacrificing a stochastic posterior one also sacrifices the ability to minimize the cost of sending latent variables. To offset this limitation, models will repeatedly downsample (RDS) the number of symbols available to latent representations between each layer. Whilst this limits posterior expressiveness, there is little experimental evidence to support how much this matters in practice. In addition, models with stochastic posteriors require large L to excel (where L is the number of latent variables), which hinders run-time. To that end, we display the results of a simple experiment in Fg. 2 designed to investigate this difference further. Here we train three VAEs across CIFAR10, ImageNet32 and ImageNet64: two with stochastic posteriors (one with and one without RDS) and a deterministic posterior (with RDS). The architectures in each model are identical (with the exception of downsampling operations) and we quantify compression performance in bits per dimension (BPD). Further experimental details can be found in the Appendix. Here we observe that, even with L ≤ 3, both stochastic posteriors outperform by ∼ 5%. This difference grows as L increases -but it does not matter if RDS is used (at least, for L ≤ 5). This result is of important consequence: the best current approach to avoid the impracticalities of bb-ANS (ie. using a deterministic posterior) carries a 5% BPD penalty. For single-image compression with stochastic posteriors, the initial bits required would typically be much larger than this. Likewise, unless extending to a deep hierarchy of latent variables, RDS seems like a compute-efficient choice that does not limit performance. Our method posits a hierarchical VAE where we parameterize the priors using an autoregressive factorisation. We begin by defining a lossless downsampling convolution operator, before describing its application to density estimation using both weak and strong autoregressive models. We then describe how this autoregressive structure can be leveraged to avoid many of the challenges associated with bb-ANS without sacrificing the performance of stochastic posteriors. Finally, we describe how these contributions can be combined to form our SHVC codec. The space-to-depth and depth-to-space transformations are popular operations across image analysis, from generative modelling [3, 14] to super-resolution [33] . They define adjacent operations for efficient up and downsampling transformations by folding spatial dimensions into channel dimensions -and vice versa. Unlike learned operations, they greatly reduce computational complexity, allowing for greater parallelism by losslessly moving computation (and data) into the channels. Indeed, these operations have become an essential component in papers seeking real-time execution (e.g. [10, 19, 30, 33] ). Specifically, given a tensor of C channels, H height and W width, we define the space-to-depth and depth-to-space transformations, f and f −1 , such that where k is the scale factor. As described in [33] , these operations can be efficiently performed using sub-pixel convolutions, which are referred to as pixel unshuffle and pixel shuffle. In particular, their space-to-depth transformation, pixel unshuffle, is performed using a k-stride depthwise convolution where the n th element of Ck 2 k×k filters has one non-zero element such that where h,w are the indices over spatial dimensions. The result of this operation is visualised in Fg. 3 Centre. Defining a channel-wise autoregression over the resulting tensor would posit a checkerboard autoregressive structure over each of the channels in the original tensor, sequentially. However, as identified in PixelCNN++ [31] , sub-pixels in adjacent channels, sharing the same spatial location in the original tensor, have high correlation and therefore do not require complex models to describe the dependency structure. As such, the authors of PixelCNN++ use a linear model predicted by a single network evaluation, conditioned on decoded context, to define the joint distribution across channels. In this way, they obfuscate the need for separate RGB network evaluations. (We note that in our setting, context refers to previously decoded pixels in either the current or previous hierarchical latent variable.) From henceforth what we refer to as a weak autoregression, is then defined similarly to [31] according to p(x c,h,w |x