key: cord-0205686-809iyd9c
authors: Guo, Qing; Chen, Junya; Wang, Dong; Yang, Yuewei; Deng, Xinwei; Carin, Lawrence; Li, Fan; Tao, Chenyang
title: Tight Mutual Information Estimation With Contrastive Fenchel-Legendre Optimization
date: 2021-07-02
journal: nan
DOI: nan
sha: 6057d88e465f316433d81a78827d89a21508744f
doc_id: 205686
cord_uid: 809iyd9c

Successful applications of InfoNCE and its variants have popularized the use of contrastive variational mutual information (MI) estimators in machine learning. While featuring superior stability, these estimators crucially depend on costly large-batch training, and they sacrifice bound tightness for variance reduction. To overcome these limitations, we revisit the mathematics of popular variational MI bounds from the lens of unnormalized statistical modeling and convex optimization. Our investigation not only yields a new unified theoretical framework encompassing popular variational MI bounds but also leads to a novel, simple, and powerful contrastive MI estimator named as FLO. Theoretically, we show that the FLO estimator is tight, and it provably converges under stochastic gradient descent. Empirically, our FLO estimator overcomes the limitations of its predecessors and learns more efficiently. The utility of FLO is verified using an extensive set of benchmarks, which also reveals the trade-offs in practical MI estimation.

Assessing the dependence between pairs of variables is integral to many scientific and engineering endeavors (Reshef et al., 2011; Shannon, 1948) . Mutual information (MI) has been established as a popular metric to quantify generic associations (MacKay, 2003) , and its empirical estimators have been widely used in applications such as independent component analysis (Bach & Jordan, 2002) , fair learning (Gupta et al., 2021) , neuroscience (Palmer et al., 2015) , Bayesian * Work primiarily done before Amazon. 1 Virginia Tech 2 Duke University 3 KAUST 4 Amazon. Correspondence to: Chenyang Tao <chenyang.tao@duke.edu>.

Preprint. Under review optimization (Kleinegesse & Gutmann, 2020) , among others. Notably, the recent advances in deep self-supervised learning (SSL) heavily rely on nonparametric MI optimization (Tishby & Zaslavsky, 2015; Oord et al., 2018; He et al., 2020; Chen et al., 2020; Grill et al., 2020) . In this study we investigate the likelihood-free variational approximation of MI using only paired samples, which leads to improving the data-efficiency of current machine learning practices.

MI estimation has been extensively studied (Battiti, 1994; Maes et al., 1997; MacKay, 2003; Paninski, 2003; Pluim et al., 2003; Torkkola, 2003) . While most classical estimators work reasonably well for low-dimensional cases, they scale poorly to big datasets: naïve density-based estimator(s) and k-nearest neighbor estimators (Kraskov et al., 2004; Pérez-Cruz, 2008; Gao et al., 2015) struggle with high-dimensional inputs, while kernel estimators are slow, memory demanding and sensitive to hyperparameters (Gretton et al., 2003; 2005) . Moreover, these estimators are usually either non-differentiable or need to hold all data in memory. Consequently, they are not well suited for emerging applications where the data representation needs to be differentiably optimized based on small-batch estimation of MI (Hjelm et al., 2019) . Alternatively, one can approach MI estimation through an estimated likelihood ratio (Suzuki et al., 2008; Hjelm et al., 2019) , but the associated numerical instability has raised concerns (Arjovsky & Bottou, 2017) .

To scale MI estimation to the growing size and complexity of modern datasets, and to accommodate the need for representation optimization (Bengio et al., 2013) , variational objectives have been widely utilized recently (Oord et al., 2018) . Instead of directly estimating data likelihoods, density ratios, or the corresponding gradients (Wen et al., 2020) , variational approaches appeal to mathematical inequalities to construct tractable lower or upper bounds of the mutual information (Poole et al., 2019) , facilitated by the use of auxiliary critic functions 1 . This practice turns MI estimation into an optimization problem. Prominent examples include the Barber-Agakov (BA) estimator (Barber & Agakov, 2004) , the Donsker-Varadhan (DV) estimator (Donsker & Varad- FLO provides a novel unified framework to analyze contrastive MI bounds and derive strong-performing new algorithms that provably converge and more stable than existing counterparts. han, 1983) , and the Nguyen-Wainwright-Jordan (NWJ) estimator (Nguyen et al., 2010) . Notably, these variational estimators closely connect to the variational objectives for likelihood inference (Alemi et al., 2018) .

Despite reporting successes, a major difficulty with these variational estimators has also been recognized: their estimation variance grows exponentially to the ground-truth MI (McAllester & Stratos, 2018) . This is especially harmful to applications involving deep neural nets, as it largely destabilizes training (Song & Ermon, 2020 ). An effective fix is to leverage multi-sample contrastive estimators, pioneered by the work of InfoNCE (Oord et al., 2018) . However, the massive reduction in the variance comes at a price: the performance of the InfoNCE estimator is upper bounded by log K, where K is the number of negative samples used (Poole et al., 2019) . For a large MI, K needs to be sufficiently large to allow for an adequate estimate, consequently constructing a significant burden on computation and memory. While variants of InfoNCE have been motivated to achieve more controllable bias and variance tradeoffs (Poole et al., 2019; Song & Ermon, 2020) , little research has been conducted on the cost-benefit aspect of contrastive learning.

A critical insight enabled by InfoNCE is that mutual information closely connects to contrastive learning (Gutmann & Hyvärinen, 2010; Oord et al., 2018) . Paralleled by the empirical successes of instance discrimination-based selfsupervision (Mnih & Kavukcuoglu, 2013; Wu et al., 2018; Chen et al., 2020; He et al., 2020) and multi-view supervision (Tian et al., 2019; Radford et al., 2021) , InfoNCE offers an InfoMax explanation to why the ability to discriminate naturally paired positive instances from the randomly paired negative instances leads to universal performance gains in these applications (Linsker, 1988; Shwartz-Ziv & Tishby, 2017; Poole et al., 2019) . Despite these encouraging developments, the big picture of MI optimization and contrastive learning is not yet complete: (i) There is an ongoing debate about to what extent MI optimization helps to learn (Tschannen et al., 2020) ; (ii) how does the contrastive view reconcile with those non-contrastive MI estimators; crucial for practical applications, (iii) are the empirical tradeoffs made by estimators such as InfoNCE absolutely necessary? And theoretically, (iv) formal guarantees on the statistical convergence of popular variational non-parametric MI estimation are missing currently.

In this work we seek to bridge the above gaps by approaching the MI estimation from the novel perspective of energy modeling. While this subject has recently been studied extensively using information-theoretic and variational inequalities, we embrace a new view from the lens of unnormalized statistical modeling. Our main contributions are four-fold: (i) unifying popular variational MI bounds under unnormalized statistical modeling; (ii) deriving a simple but powerful novel contrastive variational bound called FLO;

(iii) providing theoretical justification of the FLO bound, such as tightness and the first convergence result for variational estimators; and (iv) demonstrating strong empirical evidence of the superiority of the new FLO bound over its predecessors. We contribute in-depth discussion to bridge the gaps between contrastive learning and MI estimation, along with principled practical guidelines informed by theoretical insights. The importance of MI in data-efficient machine learning is highlighted with novel applications.

This section briefly reviews the mathematical background needed for our subsequent developments.

Unnormalized statistical modeling defines a rich class of models of general interest. Specifically, we are interested in problems for which the system is characterized by an energy functionp θ (x) : X → R : exp(−ψ θ (x)), where θ is the system parameters and ψ θ (x) is known as the potential function.

The goal is to find a solution that is defined by a normalized version ofp θ (x), i.e., min θ L

where L(·) is the loss function, µ is the base measure on X and Z(θ) p θ (x ) dµ(x ) is called the partition function forp θ (x). Problems in the above form arise naturally in statistical physics (Reichl, 2016) , Bayesian analysis (Berger, 2013) , and maximal likelihood estimation (Tao et al., 2019) . A major difficulty with unnormalized statistical modeling is that the partition function Z(θ) is generally intractable for complex energy functions 2 , and in many applications Z(θ) is further composed by log Z(θ), whose concavity implies any finite sample estimate Monte-Carlo of Z(θ) will render the loss function biased. Bypassing the difficulties caused by the intractable partition function is central to unnormalized statistical modeling.

Mutual information and unnormalized statistical models. As a generic score assessing the dependency between two random variables (X, Y ), mutual information is formally defined as the Kullback-Leibler divergence (KL) between the joint distribution p(x, y) and product of the respective marginals p(x)p(y) (Shannon, 1948) , i.e., I(X; Y )

p(x)p(y) . The integrand log p(x,y) p(x)p(y) is often known as the point-wise mutual information (PMI) in the literature. Mutual information has a few appealing properties: (i) it is invariant wrt invertible transformations of x and y, and (ii) it has the intuitive interpretation of reduced uncertainty of one variable given another variable 3 .

To connect MI to unnormalized statistical modeling, we consider the classical Barber-Agakov (BA) estimator of MI (Barber & Agakov, 2003) . To lower bound MI, BA introduces a variational approximation q(y|x) for the posterior p(y|x), and by rearranging the terms we obtain an inequality

Here we have used I BA (X; Y |q) to highlight the dependence on q(y|x), and when q(y|x) = p(y|x) the bound is sharp. This naïve BA bound is not useful for sample-based MI estimation, as we do not know the ground-truth p(y). But we can bypass this by setting q θ (y|x) = p(y) Z θ (x) e g θ (x,y) , where we call e g θ (x,y) the tilting function and recognize Z θ (x) = E p(y) [e g θ (x,y) ] as the associated partition function. Substituting this q θ (x|y) into (22) gives the following unnormalized BA bound (UBA) that pertains to unnormalized statistical modeling (Poole et al., 2019) 

.

(2) This UBA bound is still intractable, but now with Z θ (x) instead of p(y) we can apply techniques introduced below to render a tractable surrogate. Please refer to the Appendix for its connections to popular MI bounds listed in Table 1 .

Noise contrastive estimation and InfoNCE InfoNCE is a multi-sample mutual information estimator proposed in (Oord et al., 2018) , built on the idea of noise contrastive estimation (NCE) (Gutmann & Hyvärinen, 2010) . With a carefully crafted noise distribution, NCE learns statistical properties of a target distribution by comparing the positive samples from the target distribution to the "negative" samples from the noise distribution, and thus is known as negative sampling in some contexts (Mnih & Kavukcuoglu, 2013; Grover & Leskovec, 2016) . The InfoNCE estimator implements this contrastive idea as follows 4

where we have used p K (x, y) to denote K independent draws from the joint density p(x, y), and {(x k , y k )} K k=1 for each pair of samples. Here the positive and negative samples are respectively drawn from the joint p(x, y) and product of marginals p(x)p(y). Intuitively, InfoNCE tries to accurately classify the positive samples when they are mixed with negative samples. The Proposition below formally connects InfoNCE to MI estimation. Proposition 2.1. InfoNCE is an asymptotically tight mutual information lower bound, i.e.

A key idea we exploit is to use the convex duality for MI estimators. Let f (t) be a proper convex, lower-semicontinuous function; then its convex conjugate function f Hiriart-Urruty & Lemaréchal, 2012) . We call f * (v) the Fenchel conjugate of f (t), which is also known as the Legendre transformation in physics, where functions of one quantity (e.g., position, pressure, temperature) are converted into functions of its conjugate quantity (e.g., momentum, volume, entropy, respectively). The Fenchel conjugate pair (f, f * ) are dual to each other, in the sense that

gives such a pair, which we will exploit in the next section.

This section presents the main result of this paper. To make the derivation more intuitive, and expose a direct connection to the multi-sample estimators, we will show how to get our new estimator from InfoNCE. See the Appendix for alternative derivations. The naïve Fenchel-Legendre Optimization Bound. Our key insight is that MI estimation is essentially an unnormalized statistical model, which can be efficiently handled by the Fenchel-Legendre transform technique. Specifically, the Fenchel-Legendre dual expression of − log(t) is given by

Now consider the InfoNCE estimator (3). To simplify notation, and to emphasize a contrastive view, we denote c(x, y, y ; g) g(x, y ) − g(x, y) as the contrastive score 5 . This allows us to rewrite InfoNCE as

Applying (4) to the above equation gives us the following naïve Fenchel-Legendre (NFL) bound with K paired empirical samples (i.e., a batch of {(x k , y k )} K k=1 ):

To implemented this bound, we model u as a parameterized function u(x 1 , y 1 , · · · , y k ). This bound is unappealing at first sight, for a number of reasons: (i) it is looser than InfoNCE due to the Fenchel inequality; (ii) u is a function of K + 1 input variables, which is cumbersome to compute and does not scale; (iii) the nested inner-loop optimization for u makes training more costly; (iv) the exponentiation of the positive-negative contrast term c(x, y, y ) amplifies the variance issue that we seek to fix. Next we show how to cleverly alleviate these issues by taking K → ∞.

The improved FLO bound. With K → ∞, the summation term converges to the expectation respect to the marginal density p(y), i.e., lim K→∞ 1 K k e c(x1,y1,y k ) = E p(y) [e c(x1,y1,y) ] ζ(x 1 , y 1 ). This reveals an important implication: in the asymptotic limit, u can be drastically simplified to u(x, y) since it only depends on the first pair of input, which allows us to define the following single-sample

Fenchel-Legendre Optimization (FLO) bound for mutual information estimation:

where we identify y as the negative sample. The pseudocode for FLO is outlined in Algorithm 1.

To better understand FLO, we seek a more intuitive understanding of the function u(x, y). Recall in UBA that the optimal critic that gives the tight MI estimate is g * (x, y) = log p(x|y) + c(x), where c(x) can be an arbitrary function of x (Ma & Collins, 2018) . Note this g * (x, y) is not directly interpretable, however, by fixing (x, y) and integrating out p(y ) with respect to this g * (x, y), we have

the likelihood ratio between the marginals and joint. On the other hand, for a fixed g(x, y), we know u(x, y) attains its optimal value when u * (x, y; g) = log E p(y ) [e cg(x,y,y ) ], and putting these together we have u * (x, y; g * ) = − log p(x, y) p(x)p(y)

.

This shows in FLO we have u(x, y) to recover the negative PMI, an appealing self-normalizing property (Gutmann & Hyvärinen, 2010) , while the g(x, y) used by competing contrastive MI bounds are not directly meaningful (because of the arbitrary drift term c(x)). On another note, we can further simplify u(x, y) to recover the more compact, yet empirically sub-optimal TUBA bound defined in Poole et al.

(2019) (see Table 1 , derivations in the Appendix).

Fast FLO implementations. To maximally encourage parameter sharing, u φ (x, y) and g θ (x, y) can be jointly implemented with a single neural network f Ψ (x, y) : X × Y → R 2 with two output heads, i.e., [u i , g i ] = f Ψ (x i , y i ). Consequently, while FLO leverages a dual critics design, it does not actually induce extra modeling cost compared to its single-critic counterparts (e.g., InfoNCE). Experiments show this shared parameterization in fact promotes synergies and speeds up learning (see Appendix).

To further enhance computation efficiency, we consider the bi-linear critic function that uses all in-batch samples as negatives. In particular, let g θ (x, y) = τ · h θ (x),h(y) , where h : X → S p andh : Y → S p are respectively encoders that map data to unit sphere S p embedded in R p+1 , a, b = a T b is the inner product operation, and τ > 0 is the inverse temperature parameter. Thus the evaluation of the Gram matrix

is a mini-batch of K-paired samples and g θ (x i , y j ) = G ij , can be massively parallelized via matrix multiplication. In this setup, the diagonal terms of G are the positive scores while the off-diagonal terms negative scores. A similar design has been widely employed in the contrastive representation learning literature (e.g., Chen et al. (2020) ) 6 . Here we simply model the PMI critic as u(x, y) = MLP(h(x),h(y)), whose computation overhead is almost neglectable in practice, where encoders h,h dominate the computations.

Due to space limitations, we elaborate the connections to the existing MI bounds here, and have relegated an extended related work discussion in a broader context to the Appendix.

From log-partition approximation to MI bounds. To embrace a more holistic understanding, we list popular variational MI bounds together with our FLO in Table 1 , and visualize their connections in Figure 1 . With the exception of JSD, these bounds can be viewed from the perspective of unnormalized statistical modeling, as they differ in how the log partition function log Z(x) is estimated. We broadly categorize these estimators into two families: the log-family (DV, MINE, InfoNCE) and the exponential-family (NWJ, TUBA, FLO). In the log-family, DV and InfoNCE are multi-sample estimators that leverage direct Monte-Carlo estimatesẐ for log Z(x), and these two differ in whether to include the positive sample in the denominator or not.

To avoid the excessive in-batch computation of the normalizer and the associated memory drain, MINE further employed an exponential moving average (EMA) to aggregate the normalizer across batches. Note for the log-family estimators, their variational gaps are partly caused by the log-transformation on finite-sample average due to Jensen's inequality (i.e., log Z = log E[Ẑ] ≤ E[logẐ]). In contrast, the objective of exponential-family estimators do not involve such log-transformation, since they can all be derived from the Fenchel-Legendre inequality: NWJ directly applies the Fenchel dual of f -divergence for MI (Nowozin et al., 2016) , while TUBA exploits this inequality to compute the log partition log Z(x) = log E p(y ) [exp(g(x, y ))]. Motivated from a contrastive view, our FLO applies the Fenchel-Legendre inequality to the log-partition of contrast scores.

A contrastive view for MI estimation. The MI estimators can also be categorized based on how they contrast the samples. For instance, NWJ and TUBA are generally considered to be non-contrastive estimators, as their objectives do not compare positive samples against negative samples on the same scale (i.e., log versus exp), and this might explain their lack of effectiveness in representation learning applications.

For JSD, it depends on a two-stage estimation procedure similar to that in adversarial training to assess the MI, by explicitly contrasting positive and negative samples to estimate the likelihood ratio. This strategy has been reported to be unstable in many empirical settings. The log-family estimators can be considered as a multi-sample, single-stage generalization of JSD. However, the DV objective can go unbounded thus resulting in a large variance, and the contrastive signal is decoupled by the EMA operation in MINE.

Designed from contrastive perspectives, InfoNCE trades bound tightness for a lower estimation variance, which is found to be crucial in representation learning applications.

Our FLO formalizes the contrastive view for exponentialfamily MI estimation, and bridges existing bounds: the PMI normalizer exp(−u(x, y)) is a more principled treatment than the EMA in MINE, and compared to DV the positive and negative samples are explicitly contrasted and adaptively normalized.

Important FLO variants. We now demonstrate that FLO is a flexible framework that not only recovers existing bounds, but also derives novel bounds such as

Recall the optimal u * (x, y) given g θ (x, y) is in the form of −g θ (x, y) + s ψ (x), and parameterizing u(x, y) in this way recovers the TUBA bound. Additionally, we note that (i) fixing either of u and g, and optimizing with respect to the other also gives a valid lower bound to MI; and (ii) a carefully chosen multi-input u({(x i , y i )}) can be computationally appealing. As a concrete example, if we set u φ

K j e c(xi,yi,yj ;g θ ) and update u θ (x, y) while artificially keeping the critic g θ (x, y) fixed 7 , then FLO falls back to DV. Alternatively, we can consider the Fenchel dual version of it: using the same multi-input u θ ({(x i , y i )}) above, treat u φ as fixed and only update g θ , and this gives us the novel MI objective in (10), we call it Fenchel-Donsker-Varadhan (FDV) estimator: Table 1 . Comparison of popular variational MI estimators. Here g(x, y), u(x, y) and u(x) are variational functions to be optimized,

, η] denotes exponential average of function f (u) with decay parameter η ∈ (0, 1), and α ∈ [0, 1] is the balancing parameter used by α-InfoNCE trading off bias and variance between InfoNCE and TUBA. To highlight the contrastive view, we use (x, p⊕) to denote samples drawn from the joint density p(x, y), and (x, y ) to denote samples drawn from the product of marginal p(x)p(y). In context, y⊕ and y have the intuitive interpretation of positive and negative samples. We exclude variational upper bounds here because their computations typically involve the explicit knowledge of conditional likelihoods.

Objective Bias Var. Converge

In this section we show the proposed FLO has appealing theoretical properties, making it an attractive candidate for MI estimation and optimization. Based on our analysis from the previous sections, the following results are immediate. Proposition 2.2. The FLO estimator is tight

+e −u(x,y)+c(x,y,y ;g) ]} + 1 (11) Corollary 2.3. Let u * (x, y) be the solution for (11), then

Gradient and convergence analyses of FLO. To further understanding of FLO, we examine its parameter gradients. Let us start with the intractable UBA bound:

We want to establish the intuition that

is our FLO estimator. Denoting the integral by E θ (x, y) 1 E p(y ) [e c(x,y,y ;g θ ) ] , we have

Since for fixed g θ (x, y) the corresponding optimal u * θ (x, y) maximizing FLO is given by

this relation implies the term e −u φ (x,y) is essentially optimized to approximate E θ (x, y). To emphasize this point, we now writeÊ θ (x, y) e −u φ (x,y) . When this approximation is sufficiently accurate (i.e., E θ ≈Ê θ ), we can see that ∇I FLO approximates ∇I UBA as follows

We can prove FLO will converge under much weaker conditions, even when this approximationû(x, y) is rough. The intuition is as follows: in (18), the termÊ θ t E θ t only rescales the gradient, which implies the optimizer is still proceeding in the right direction. The informal version of our result is summarized in the Proposition below (see the Appendix for the formal version and proof). FLO consistently performs best, demonstrating superior strength in learning efficiency and robustness. NWJ takes the runner-up, but it is more variable and sensitive to network initializations. InfoNCE is less competitive due to low sample inefficiency, but its smaller variance helps in the more challenging dynamic case.

is bounded between [a, b] (0 < a < b < ∞), then under the stochastic gradient descent scheme described in Algorithm 1, θ t converges to a stationary point of I UBA (g θ ) with probability 1, i.e., lim t→∞ ∇I UBA (g θt ) = 0. Additionally assume I UBA is convex with respect to θ, then FLO converges with probability 1 to the global optimum θ * of I UBA from any initial point θ 0 .

Importantly, this marks the first convergence result for variational MI estimators. The convergence analyses for MI estimation is non-trivial and scarce even for those standard statistical estimators (Paninski, 2003; Gao et al., 2015; Rainforth et al., 2018) . The lack of convergence guarantees has led to a proliferation of unstable MI-estimators used in practice (in particular, DV, JSD, and MINE) that critically rely on various empirical hacks to work well (see discussions in Song & Ermon (2020)). Our work establishes a family of variational MI estimators that provably converges, a contribution we consider significant as it fills an important gap in current literature on both theoretical and practical notes.

We consider an extensive range of tasks to validate FLO and benchmark it against state-of-the-art solutions. Limited by space, we present only the key results in the main text, and defer ablation studies and details of our experimental setups to the Appendix. Our code is available from https://github.com/author_name/FLO. All experiments are implemented with PyTorch.

Comparison to baseline MI bounds. We start by comparing FLO to the following popular competing variational estimators: NWJ, TUBA, and InfoNCE. We use the bilinear critic implementation for all models which maximally encourages both sample efficiency and code simplicity, and this strategy does perform best based on our observations.

We consider the synthetic benchmark from (Poole et al., 2019) , where (X ∈ R d , Y ∈ R d ) is jointly standard Gaussian with diagonal cross-correlation parameterized by ρ ∈ [0, 1). We report d = 10 and ρ ∈ [0, 0.99] here 8 , which provides a reasonable coverage of the range of MI one may encounter in empirical settings. To focus on the bias-variance trade-off, we plot the decimal quantiles in addition to the estimated MI in Figure 3 , where FLO significantly outperformed its variational counterparts in the more challenging high-MI regime. In Figure 7 , we show FLO also beats classical MI estimators.

Bayesian optimal experiment design (BOED). We next direct out attention to BOED, a topic of significant interest shared by the statistical and machine learning communities (Chaloner & Verdinelli, 1995; Wu & Hamada, 2011; Hernández-Lobato et al., 2014; Foster et al., 2020) . The performance of machine learning models crucially relies on the quality of data supplied, and BOED is a principled framework that optimizes the data collection procedure (in statistical parlance, conducting experiments) (Foster et al., 2019) . Mathematically, let x be the data to be collected, θ be the parameters to be inferred, and d be the experiment parameters the investigator can manipulate (a.k.a, the design parameters), BOED solves arg max d I(x; θ; d). We focus on the more generic scenario where explicit likelihoods are not available (Kleinegesse & Gutmann, 2020; .

We consider three carefully-selected models from recent literature for their progressive practical significance and the challenges involved (Foster et al., 2021; Ivanova et al., 2021; : static designs of (i) simple linear regression model and (ii) complex nonlinear pharmacokinetic model for drug development; and the dynamic policy design for (iii) epidemic disease surveillance and intervention (e.g., for Covid-19 modeling). In Figure 4 we compare design optimization curves using different MI optimization strategies, where FLO consistently leads. Popular NWJ and InfoNCE reports different tradeoffs that are less susceptible to FLO. We examine the FLO predicted posteriors and confirm they are consistent with the ground-truth parameters ( Figure 7 ). For the dynamic policy optimization, we also manually inspect the design strategies reported by different models ( Figure 5 ). Consistent with human judgement, FLO policy better assigns budgeted surveillance resources at different stages of pandemic progression.

A novel meta-learning framework. A second application of our work is to meta-learning, an area attracting substantial recent interest. In meta-learning, we are concerned with scenarios that at training time, there are abundant different labelled tasks, while upon deployment, only a handful of labeled instances are available to adapt the learner to a new task. Briefly, for an arbitrary loss t (ŷ, y), where t is the task identifier andŷ = f (x) is the prediction made by the model, we denote the risk by R t (f ) = E pt(x,y) [ t (f (x), y)].

Denote R(f ) E t∼p(t) [R t (f )] as the expected risk for all tasks andR(f ) for the mean of empirical risks computed from all training tasks. Inspired by recent informationtheoretic generalization theories for deep learning (Xu & Raginsky, 2017) , we derived a novel, principled objective

where λ is known given the data size and loss function, (D t ,Ê t ) are respectively data and task embeddings for training data, which for the first time lifts contrastive learning to the task and data distribution level. Our reasoning is that L Meta-FLO (f ) theoretically bounds R(f ) from above, and it is relatively sharp for being data-dependent. We give more information on this in the Appendix and defer a full exposition to a dedicated paper due to independent interest and space limits here. Note other MI bounds are not suitable for this task due to resource and variance concerns. In Figure 6 we show Meta-FLO wins big over the state-ofthe-art model agnostic meta-learning (MAML) model on the regression benchmark from (Finn et al., 2017) .

Converging evidence from more applications. In addition to the tasks above, we also examined FLO with extensive ablations and validated its utility of tasks such as representation learning (Table 2 ) and fair classification. Due to space limits, results and analyses are delegated to the Appendix. We also point readers to other researches inspired by our work, which shows significant boosts in both performance and sample efficiency compared to InfoNCEbased SOTA solutions on ImageNet-scale datasets using the FDV bound (Chen et al., 2021) .

We have described a new framework for the contrastive estimation of mutual information from energy modeling perspectives. Our work not only encapsulates popular variational MI bounds but also inspires novel objectives such as FLO and FDV, which comes with strong theoretical guarantees. In future work, we seek to leverage our theoretical insights to improve practical applications involving MI estimation, such as representation learning and algorithmic fairness, and in particular, data efficient learning. Kleinegesse, S. and Gutmann, M. U. Bayesian experimental design for implicit models by mutual information neural estimation. In ICML. PMLR, 2020.

Kleinegesse, S. and Gutmann, M. U. Gradient-based bayesian experimental design for implicit models using mutual information lower bounds. arXiv preprint arXiv:2105.04379, 2021.

Kleinegesse, S., Drovandi, C., and Gutmann, M. U. Sequential bayesian experimental design for implicit models via mutual information. Bayesian Analysis, 1( 

Proof. We can bound MI from below using an variational distribution q(y|x) as follows:

In sample-based estimation of MI, we do not know the ground-truth marginal density p(y), which makes the above BA bound impractical. However, we can carefully choose an energy-based variational density that "cancels out" p(y):

This auxiliary function f (x, y) is known as the tilting function in importance weighting literature. Hereafter, we will refer to it the critic function in accordance with the nomenclature used in contrastive learning literature. The partition function Z f (x) normalizes this q(y|x). Plugging this q f (y|x) into I BA yields:

For x, a > 0, we have inequality log(x) ≤ x a + log(a) − 1. By setting x ← Z(y) and a ← e, we have log(Z(y)) ≤ e −1 Z(y).

Plugging this result into (25) we recover the celebrated NWJ bound, which lower bounds I UBA :

When f (x, y) takes the value of

this bound is sharp.

We next extend these bounds to the multi-sample setting. In this setup, we are given one paired sample (x 1 , y 1 ) from p(x, y) (i.e., the positive sample) and K − 1 samples independently drawn from p(y) (i.e., the negative samples). Note that when we average over x wrt p(x) to compute the MI, this equivalent to comparing positive pairs from p(x, y) and negative pairs artificially constructed by p(x)p(y). By the independence between X 1 and Y k>1 , we have

So for arbitrary multi-sample critic f (x; y 1:K ), we know

Now let us setf (x 1 ; y 1:K ) = 1 + log e g(x1,y1) m(x 1 ; y 1:K )

, m(x 1 ; y 1:K ) = 1 K k e g(x1,y k ) .

(31)

Due to the symmetry of {y k } K k=1 , we have

So this gives

which proves

Now we need to show this bound is sharp when K → ∞. Recall the optimal f (x, y) is given by f * (x, y) = p(y|x) p(y) .

This concludes our proof.

Proof. Equation (17) is a direct consequence of applying the Fenchel duality trick to the UBA bound. We already know UBA is sharp when g * (x, y) = log p(x|y) + c(x), and the Fenchel duality holds when u * (x, y; g) = log E p(y ) [exp(g(x, y ) − g(x, y))]. So Equation (17) holds with (g, u) = (g * , u * (g * )). We can also see this from equation (14).

Proof. This is immediate from equation (14).

While the above relation shows we can use FLO to amortize the learning of UBA, one major caveat with the above formulation is thatû(x, y) has to be very accurate for it to be valid. As such, one needs to solve a cumbersome nested optimization problem: update g θ , then optimize u φ until it converges before the next update of g θ . Fortunately, we can show that is unnecessary: the convergence can be established under much weaker conditions, which justifies the use of simple simultaneous stochastic gradient descent for both (θ, φ) in the optimization of FLO.

Our proof is based on the convergence analyses of generalized stochastic gradient descent from (Tao et al., 2019) . We cite the main assumptions and results below for completeness.

Definition E.1 (Generalized SGD, Problem 2.1 (Tao et al., 2019) ). Let h(θ; ω), ω ∼ p(ω) be an unbiased stochastic gradient estimator for objective f (θ), {η t > 0} is the fixed learning rate schedule, {ξ t > 0} is the random perturbations to the learning rate. We want to solve for ∇f (θ) = 0 with the iterative scheme θ t+1 = θ t +η t h(θ t ; ω t ), where {ω t } are iid draws andη t = η t ξ t is the randomized learning rate.

Assumption E.2. (Standard regularity conditions for Robbins-Monro stochastic approximation, Assumption D.1 (Tao et al., 2019) ).

A2. The ODEθ = h(θ) has a unique equilibrium point θ * , which is globally asymptotically stable;

A3. The sequence {θ t } is bounded with prob 1;

A4. The noise sequence {ω t } is a martingale difference sequence;

A5. For some finite constants A and B and some norm · on R d , E[ ω t 2 ] ≤ A + B θ t 2 a.s. ∀t ≥ 1.

Proposition E.3 (Generalized stochastic approximation, Proposition 2.2 (Tao et al., 2019) ). Under the standard regularity conditions listed in Assumption E.2, we further assume t E[η t ] = ∞ and t E[η 2 t ] < ∞. Then θ n → θ * with probability 1 from any initial point θ 0 .

Assumption E.4. (Weaker regularity conditions for generalized Robbins-Monro stochastic approximation, Assumption G.1 (Tao et al., 2019) ).

The objective function f (θ) is second-order differentiable.

The objective function f (θ) has a Lipschitz-continuous gradient, i.e., there exists a constant L satisfying −LI ∇ 2 f (θ) LI, B3. The noise has a bounded variance, i.e., there exists a constant

Proposition E.5 (Weaker convergence results, Proposition G.2 (Tao et al., 2019) ). Under the technical conditions listed in Assumption E.4, the SGD solution {θ t } t>0 updated with generalized Robbins-Monro sequence (η t : t E[η t ] = ∞ and

First, we start validating the properties and utility of the proposed FLO estimator by comparing it to competing solutions with the Gaussian toy models, so that we also have the ground-truth MI for reference.

We choose TUBA, NWJ, InfoNCE and α-InfoNCE as our baselines. We set α = 0.8 for the α-InfoNCE because among all other choices it better visualizes the bias-variance trade-offs relative to InfoNCE. NWJ and InfoNCE are the two most popular estimators in practice that are employed without additional hacks. TUBA is included for its close relevance to FLO (i.e., optimizing u(x) instead of u(x, y), and being non-contrastive). We do not include DV here because we find DV needs excessively a large negative sample size K to work. Variants like MINE are excluded for involving additional tuning parameters or hacks which complicates analyses. The proposed FDV estimator is also excluded from our analyses for bound comparison since it includesÎ DV in the estimator. Note that although not suitable for MI estimation, we find FDV works quite well in representation learning settings where the optimization of MI is targeted. This is because in FDV, the primal termÎ DV term does not participate gradient computation, so it does not yield degenerated performance as that of DV.

We use the following baseline setup for all models unless otherwise specified. For the critic functions g(x, y), u(x, y) and u(x), we use multi-layer perceptron (MLP) network construction with hidden-layers 512 × 512 and ReLU activation. For optimizer, we use Adam and set learning rate to 10 −4 . A default batch-size of 128 is used for training. To report the estimated MI, we use 10k samples and take the average. To visualize variance, we plot the decimal quantiles at {10%, 20%, · · · , 80%, 90%} and color code with different shades. We sample fresh data point in each iteration to avoid overfitting the data. All models are trained for ∼ 5, 000 iterations (each epoch samples 10k new data points, that is 78 iterations per epoch for a total of 50 epochs).

For Figure 3 , we use the 2-D Gaussian with ρ = 0.5, and the contour plot is obtained with a grid resolution of 2.5 × 10 −2 .

For the shared parameterization experiment for FLO ( Figure 5 ), we also used the more challenging 20-D Gaussian with ρ = 0.5, and trained the network with learning rate 10 −3 and 10 −4 respectively. We repeat the experiments for 10 times and plot the distribution of the MI estimation trajectories. The MLP network architecture we use was unable to get a sharp estimate in this setting (for other estimators too), but FLO with a shared network learns faster than its separate network counterpart, validating the superior efficiency of parameter sharing.

We setup the bi-linear critic experiment as follows. For the baseline FLO, we use the shared-network architecture for g(x, y) and u(x, y), and use the in-batch shuffling to create the desired number of negative samples (FLO-shuff). For FLO-BiL, we adopt the following implementation: feature encoders h(x),h(y) are respectively modeled with three layer MLP with 512-unit hidden layers and ReLU activations, and we set the output dimension to 512. Then we concatenate the feature representation to z = [h(x),h(y)] and fed it to the u(x, y) network, which is a two-layer 128-unit MLP. Note that is merely a convenient modeling choice and can be further optimized for efficiency. Each epoch containing 10k samples, and FLOshuff is trained with fixed batch-size. For FLO-BiL, it is trained with batch-size set to the negative sample-size desired, because all in-batch data are served as negatives. We use the same learning rate 10 −4 for both cases, and this puts large-batch training at disadvantage, as fewer iterations are executed. To compensate for this, we use T (K) = ( K K0 ) 1 2 · T 0 to set the total number of iterations for FLO-BiL, where (T 0 , K 0 ) are respectively the baseline training iteration and negative sample size used by FLO-shuff, and the number of negative sample K are {10, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500}. We are mostly interested in computation efficiency here so we do not compare the bound. In Figure 6 , we see the cost for training FLO-shuff grows linearly as expected. For FLO-BiL, a U-shape cost curve is observed. This is because the network is roughly 2× more complicated for the bi-linear implementation (i.e., two more MLPs), and using a larger batch-size actually increases overall efficiency until the device capacity has been reached, which explains the initial drop in cost, followed by the anticipated square-root growth.

Cross-view representation learning experiment. In this experiment, we use the training split (i.e., 60k) to train the cross-view representation and the prediction model based on the concatenated cross-view features. For CCA extraction, we use the scikit-learn.cross decomposition.CCA implementation with default settings. For all other MI-based solutions, we use the multi-sample estimators and adopt the bi-linear critic implementation as described in the bi-linear experiment for maximal efficiency. The prediction model is given by a three-layer MLP of our standard configuration, and trained on the extracted cross-view features for 50 epochs with learning rate 10 −3 . Comparison of shared and separate parameterization for (g θ , u φ ). Single-network parameterization not only runs faster but also learns learns more efficiently.

In Figure 4 , all methods applied to 20D Gaussian (i.e., that is 10 independent pairs of correlated x and y). We repeat in-batch shuffling K-times for the y-dimensions to get the desired number of negative samples.

Comprehensive analyses of bias-variance trade-offs. To supplement our results in the main paper, here we provide additional bias-variance plots for different MI estimators under various settings. In Figure 2 we show the bias-variance plot of MI estimates for 2-D Gaussians. In this case, the network used are sufficiently comprehensive so sharp estimate is attainable. In all cases the estimation variance grows with MI value, which is consistent with the theoretical prediction that for tight estimators, the estimation variance grows exponential with MI (McAllester & Stratos, 2018) . In such cases, the argument for InfoNCE's low-variance profile no longer holds: it is actually performing sub-optimally. For complex real applications, the negative sample size used might not provide an adequate estimate of ground-truth MI (i.e., the log K cap), and that is when InfoNCE's low-variance profile actually helps. We also notice that, when the MI estimate is not exactly tight, but very close to the true value, the variance dropped considerably. This might provide alternative explanation (and opportunity) for the development near-optimal MI estimation theories, which is not covered in existing literature.

We also tried the single-sample estimators for NWJ, TUBA and FLO to their multi-sample InfoNCE-based counterparts (Figure 3) , which is the comparison made by some of the prior studies. In this setting, the variance single-sample estimators' variances are considerably larger, which explains their less favorable performance. Note that contradictory to theoretical predictions, a larger negative sample size does make NWJ, TUBA and FLO tighter empirically, although the gains are much lesser compare to that of InfoNCE (partly because these three estimators are already fairly tight relative to InfoNCE). This might be explained by a better optimization landscape due to reduced estimation variance. We conjecture that for multi-sample NWJ, TUBA and FLO, the performance in empirical applications such as self-supervised learning should be competitive to that of InfoNCE, which has never been reported in literature.

Network capacity and MI estimation. We further investigate how the neural network learning capacity affect MI estimation.

In Figure 4 we compare the training dynamics of the FLO estimator with L-layer neural networks, where L ∈ {2, 3, 6} and each hidden-layer has 512-units. A deeper network is generally considered to be more expressive. We see that using larger networks in general converge faster in terms of training iterations, and also obtain better MI estimates. However, more complex networks imply more computation per iteration, and it can be less stable when trained with larger learning rates. 

In addition to the results reported in the paper, we investigate how different latent dimension affect the results of the cross-view representation learning. We vary the latent dimension number from d = 2 to d = 20, and plot label prediction accuracy for the corresponding latent representations in Figure 5 . The same setup for the bi-linear experiment is used for the MI estimation (for all MI estimators), where the images are flattened to be fed to the MLPs. The representations are trained for 50 epochs and the prediction model is trained for 50 epochs. We also trained the model for another 50 epochs and the conclusions are similar. We see that FDV works well for lower dimensions (e.g., d ≈ 5), and what works better for higher dimensions (d > 10) are FLO and InfoNCE.

We also compare our FLO estimator to the classical MI estimators in Figure 6 . The following implementations of baseline estimators for multi-dimensional data are considered: (i) KDE: we use kernel density estimators to approximate the joint and marginal likelihoods, then compute MI by definition; (ii) NPEET 9 , a variant of Kraskov's K-nearest neighbour (KNN) estimator (Kraskov et al., 2004; Ver Steeg & Galstyan, 2013) ; (iii) KNNIE 10 , the original KNN-estimator and its revised variant (Gao et al., 2018) . These models are tested on 2-D and 20-D Gaussians with varying strength of correlation, with their hyper-parameters tuned for best performance. Note that the notation of "best fit" is a little bit subjective, as we will fix the hyper-parameter for all dependency strength, and what works better for weak dependency might necessarily not work well for strong dependency. We choose the parameter whose result is visually most compelling. In addition to the above, we have also considered other estimators such as maximal-likelihood density ratio 11 (Suzuki et al., 2008) and KNN with local non-uniformity correction 12 . However, these models either do not have a publicly available multi-dimensional implementation, or their codes do not produce reasonable results 13 .

I. Regression with Sensitive Attributes (Fair Learning) Experiments I.1. Introduction to fair machine learning Nowadays consequential decisions impacting people's lives have been increasingly made by machine learning models. Such examples include loan approval, school admission, and advertising campaign, amongst others. While automated decision making has greatly simplified our lives, concerns have been raised on (inadvertently) echoing, even amplifying societal biases. Specially, algorithms are vulnerable in inheriting discrimination from the training data and passed on such prejudices in their predictions.

To address the growing need for mitigating algorithmic biases, research has been devoted in this direction under the name fair machine learning. While discrimination can take many definitions that are not necessarily compatible, in this study we focus on the most widely recognized criteria Demographic Parity (DP), as defined below Definition I.1 (Demographic Parity, (Dwork et al., 2012) ). The absolute difference between the selection rates of a decision ruleŷ of two demographic groups defined by sensitive attribute s, i.e., DP(Ŷ , S) = P(Ŷ = 1|S = 1) − P(Ŷ = 1|S = 0) .

With multiple demographic groups, it is the maximal disparities between any two groups:

To scrub the sensitive information from data, we consider the in-processing setup L = Loss(Predictor(Encoder(x i )), y i Primary loss

By regularizing model training with the violation of specified fairness metric ∆(ŷ, s), fairness is enforced during model training. In practice, people recognize that appealing to fairness sometimes cost the utility of an algorithm (e.g., prediction accuracy) (Hardt et al., 2016) . So most applications seek to find their own sweet points on the fairness-utility curve. In our example, it is the DP-error curve. A fair-learning algorithm is consider good if it has lower error at the same level of DP control.

In this experiment, we compare our MI-based fair learning solutions to the state-of-the-art methods. Adversarial debiasing tries to maximize the prediction accuracy for while minimize the prediction accuracy for sensitivity group ID . We use the implementation from AIF360 14 package (Bellamy et al., 2018) . FERMI is a density-based estimator for the exponential Rényi mutual information ERMI E p(x,y) [ p(x,y) p(x)p(y) ], and we use the official codebase. For evaluation, we consider the adult data set from UCI data repository (Asuncion & Newman, 2007) , which is the 1994 census data with 30k samples in the train set and 15k samples in the test set. The target task is to predict whether the income exceeds $50k, where gender is used as protected attribute. Note that we use this binary sensitive attribute data just to demonstrate our solution is competitive to existing solutions, where mostly developed for binary sensitive groups. Our solution can extend to more general settings where the sensitive attribute is continuous and high-dimensional. 11 https://github.com/leomuckley/maximum-likelihood-mutual-information 12 https://github.com/BiuBiuBiLL/NPEET_LNC 13 These are third-party python implementations, so BUGs are highly likely. 14 https://github.com/Trusted-AI/AIF360 We implement our fair regression model as follows. To embrace data uncertainty, we consider latent variable model p θ (y, x, z) = p θ (y|z)p θ (x|z)p(z), where v = {x, y} are the observed predictor and labels. Under the variational inference framework (Kingma & Welling, 2014) , we write the ELBO(v; p θ (v, z), q φ (z|v)) as

p(z) is modeled with standard Gaussian, and the approximate posterior q φ (z|v) is modeled by a neural network parameterizing the mean and variance of the latents (we use the standard mean-field approximation so cross-covariance is set to zero), and β is a hyperparameter controlling the relative contribution of the KL term to the objective. Note that unlike in the standard ELBO we have dropped the term E Z∼q φ (z|v) [log p θ (x|Z)] because we are not interested in modeling the covariates. Note this coincides with the variational information bottleneck (VIB) formulation (Alemi et al., 2016) . Additionally, the posterior q φ (z|v) will not be conditioned on y, but only on x, because in practice, the labels y are not available at inference time. All networks used here are standard three-layer MLP with 512 hidden-units.

For Figure 7 , we note that the adversarial de-biasing actually crashed in the DP range [0.1, 0.18], so the results have to be removed. Since interpolation is used to connect different data points, it makes the adversarial scheme look good in this DP range, which is not the case. FERMI also gave unstable estimation in the DP range [0.1, 0.18]. Among the MI-based solutions, NWJ was most unstable. Performance-wise, InfoNCE, TUBA and FDV are mostly tied, with the latter two slightly better in the "more fair" solutions (i.e., at the low DP end).

We further consider the self-supervised learning setup. In particular, we use MNIST data and a ResNet-10 network for feature extraction. To optimize representation without labels, we optimize the MI between two views of the digits: randomly rotated (0-30 degree) and resize & cropped (scale 0.5-1.0). We train with learning rate 10 −4 for 50-epochs, and report the performance by training a linear classifier using the learned representation. See Table 1 for results. In this experiment, FLO worked best, followed by InfoNCE.

We further compared different contrastive representation learning models on the Cifar10 dataset. We compare our FLO and FDV to the InfoNCE and the recently proposed SpectralNCE (HaoChen et al., 2021) . We follow the setup of SimCLR paper and optimize the contrastive loss across two randomly augmented views for 200 epochs. To evaluate, we compute the InfoNCE loss with K = 2, 560, using the critic learned by each objective. Note that SpectralNCE is optimized for linear representation learning, but it does not directly target mutual information. FDV learns the best MI followed by FLO.

Our setup is the same as the Noisy Linear Model in (Kleinegesse & Gutmann, 2020) . We use 10 individual experimental designs. For encoder θ and encoder y, we use MLP with 2-layer, 128-dim hidden layer, and set the feature dim as 512. We train models in 5000 epochs, the batch size is 64, and the learning rate is 2 * 10 −5 . Four MI estimators (NWJ, TUBA, InfoNCE, and FLO) has been compared in this experiment and we got four optimized designs. Then, we use MCMC to estimate the posterior of the parameters.

The settings of this experiment refer to the Pharmacokinetic Model of (Kleinegesse & Gutmann, 2020) . We use 10 individual experimental designs. The MLP is with 2-layer, 128-dim hidden layer, and set the output feature dim as 512. We train 10000 epochs with learning rate is 10 −5 via four methods (NWJ, TUBA, InfoNCE, FLO).

We here consider the spread of a disease within a population of N individuals, mod-elled by stochastic versions of the well-known SIR (Allen et al., 2008) . a susceptible state S(t) and can then move to an infectious state I(t) with an infection rate of β. These infectious individuals then move to a recovered state R(t) with a recovery rate of γ, after which they can no longer be infected. The SIR model, governed by the state changes S(t) → I(t) → R(t), thus has two model parameters θ 1 = (β, γ).

The stochastic versions of these epidemiological processes are usually defined by a continuous-time Markov chain (CTMC), from which we can sample via the Gillespie algorithm (Allen, 2017 ). However, this generally yields discrete population states that have undefined gradients. In order to test our gradient-based algorithm, we thus resort to an alternative simulation algorithm that uses stochastic differential equations (SDEs), where gradients can be approximated.

We first define population vectors X 1 (t) = (S(t), I(t)) for the SIR model and X 2 (t) = (S(t), E(t), I(t)) for the SEIR model. We can effectively ignore the population of recovered because the total population is fixed. The system of Itô SDEs for the above epidemiological processes is dX(t) = f (X(t)) dt + G(X(t)) dW (t),

where f is the drift term, G is the diffusion term and W is the Wiener process. Euler-Maruyama algorithm is used to simulate the sample paths of the above SDEs. 

We use the infection rate (I) as 0.1 and the recovery (R) rate as 0.01. The independent priors are N(0.1,0.02) and N(0.01, 0.002). The initial infection number is 10. We update MI one time after updating sampler three steps.We use RNN network with 2 layer 64 dim hidden layer construction to decoder the sequential design.

Model parameterization. Now we want to show how different parameterization schemes affect the performance and learning efficiency for FLO. In Figure 4 (paper), we visualize the learning dynamics of FLO using a shared network for (g θ , u φ ) and that with two separate networks. The parameter sharing not only cuts computations, it also helps to learn faster. There is no discernible difference in performance and FLO-separate used twice much of iterations to converge. Next, we compare the bi-linear critic implementation (FLO-BiL) to the standard MLP with paired inputs (x, y) ( Figure 10 ). In the bi-linear case K is tied to batch-size so we scale the computation budget with T (K) = ( K K0 ) 1 2 · T 0 , where (T 0 , K 0 ) are respectively the baseline training iteration and negative sample size used by FLO-MLP. We see FLO-BiL has drastically reduced computations (Figure 10 ). Figure 9 , we show the learning process of each estimator when ρ = 0.9. We can find FLO can achieve the best estimation. Although, InfoNCE is more stable, but it is easily saturated.

Intuitions. Now let us describe the new Meta-FLO model for meta-learning. Given a model space M and a loss function : M × Z → R, the true risk and the empirical risk of f ∈ M are respectively defined as R t (f ) E Z∼µt [ (f, Z)] andR t (f ; S t ) 1 m m i=1 (f, Z i ).Let us denote R τ is the generalization error for the task distribution τ where all tasks originate, andR τ is the empirical estimate. Our heuristic is simple, that is to optimize a tractable upper bound of the generalization risk given by For meta-learning, we sample n-tasks for training and n -tasks for testing, respectively denoted as S 1:n and Stest 1:n . We further decouple the learning algorithm into two parts: the meta-learner A meta (S 1:n ) that consumes all train data to get the meta-model f meta , and then task-adaptation learner A adapt (f meta , S t ) which adapts the meta-model to the individual task data S t to get task model f t . For parameterized models such as deep nets, we denote Θ as our meta parameters and E t as task-parameters, that is to say Θ A meta (S 1:n ), E t A adapt (Θ, S t ), where Θ, E t can be understood as weights of deep nets. In subsequent discussions, we will also call E t the task-embedding. We can define the population meta-risk as R τ (Θ) E t,Θ=Ameta(S1:n) [E Et=Aadapt(Θ,St) [R t (f Et )]], and similarly for the empirical riskR τ evaluated on the query set Q t . Our model is based on the following inequality (Anonymous, 2022):

which gives the main objective L Meta-FLO (f ) =R(f ) + λ I FLO (D t ;Ê t ). We summarize our model architecture in Figure  8 .

Fixing a broken ELBO

Deep variational information bottleneck

A primer on stochastic epidemic models: Formulation, numerical simulation, and analysis

Anonymous. Meta-flo: Principled simple fast few-shot learning with stochastic prompt encoding networks

Towards principled methods for training generative adversarial networks

Uci machine learning repository

Kernel independent component analysis

The IM algorithm: a variational approach to information maximization

Using mutual information for selecting features in supervised neural net learning

Mutual information neural estimation

AI Fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias

Representation learning: A review and new perspectives

Statistical decision theory and Bayesian analysis

Bayesian experimental design: A review

faster, stronger: Breaking the log-k curse on contrastive learners with flatnce

A simple framework for contrastive learning of visual representations

Asymptotic evaluation of certain markov process expectations for large time

Fairness through awareness

Model-agnostic meta-learning for fast adaptation of deep networks

Variational bayesian optimal experimental design

A unified stochastic gradient approach to designing bayesian-optimal experiments

Deep adaptive design: Amortizing sequential bayesian experimental design

Efficient estimation of mutual information for strongly dependent variables

Demystifying fixed k-nearest neighbor information estimators

The kernel mutual information

Kernel methods for measuring independence

Bootstrap your own latent: A new approach to self-supervised learning

Scalable feature learning for networks

Controllable guarantees for fair outcomes via contrastive information estimation

Noise-contrastive estimation: A new estimation principle for unnormalized statistical models

Provable guarantees for self-supervised deep learning with spectral contrastive loss

Equality of opportunity in supervised learning

Momentum contrast for unsupervised visual representation learning

Predictive entropy search for efficient global optimization of black-box functions

Fundamentals of convex analysis

Learning deep representations by mutual information estimation and maximization

Implicit deep adaptive design: Policy-based experimental design without likelihoods

Auto-encoding variational Bayes

Bayesian experimental design for implicit models by mutual information neural estimation

Gradient-based bayesian experimental design for implicit models using mutual information lower bounds

Sequential bayesian experimental design for implicit models via mutual information

Estimating mutual information

Self-organization in a perceptual network

Noise contrastive estimation and negative sampling for conditional models: Consistency and statistical efficiency

Information theory, inference and learning algorithms

Multimodality image registration by maximization of mutual information

Formal limitations on the measurement of mutual information

Learning word embeddings efficiently with noise-contrastive estimation

Estimating divergence functionals and the likelihood ratio by convex risk minimization

Training generative neural samplers using variational divergence minimization

Representation learning with contrastive predictive coding

Predictive information in a sensory population

Pérez-Cruz, F. Estimation of information theoretic measures for continuous random variables

Mutual-information-based registration of medical images: a survey

On variational bounds of mutual information

Learning transferable visual models from natural language supervision

On nesting monte carlo estimators

A modern course in statistical physics

Detecting novel associations in large data sets

A mathematical theory of communication. The Bell system technical journal

Understanding the limitations of variational mutual information estimators

Approximating mutual information by maximum likelihood density ratio estimation. In New challenges for feature selection in data mining and knowledge discovery

On fenchel mini-max learning

Contrastive multiview coding

Deep learning and the information bottleneck principle

Feature extraction by non-parametric mutual information maximization

On mutual information maximization for representation learning

Information-theoretic measures of influence based on content dynamics

Mutual information gradient estimation for representation learning

Experiments: planning, analysis, and optimization

Unsupervised feature learning via non-parametric instance discrimination

Information-theoretic analysis of generalization capability of learning algorithms

Mitigating unwanted biases with adversarial learning

The sin-wave adaptation experiment involves regressing from the input (x ∼ Uniform([−5, 5])) to the output of a sine wave κ sin(x − γ), where amplitude κ ∼ Uniform ([0.1, 5]) and phase (γ ∼ Uniform([0, π]) of the sinusoid vary for each task. We use mean-squared error (MSE) as our loss and set the support-size = 3 and query-size = 2. We use simple three-layer MLPs for all the models: regressor, prompt encoder, and FLO critics, with hidden units all set to [512, 512]. During training, we use an episode-size of 64. For MAML, we use the first-order implementation (FOMAML), and set inner learning rate to α = 10 −4 . For Meta-FLO, we set regularization strength to λ = 10 −2 .