key: cord-0538741-0bbhmog2
authors: Courtoy, Aurore; Houston, Joey; Nadolsky, Pavel; Xie, Keping; Yan, Mengshi; Yuan, C.-P.
title: Parton distributions need representative sampling
date: 2022-05-20
journal: nan
DOI: nan
sha: e4344f8093e2e227b129636c0873c12b7b23af68
doc_id: 538741
cord_uid: 0bbhmog2

In global QCD fits of parton distribution functions (PDFs), a large part of the estimated uncertainty on the PDFs originates from the choices of parametric functional forms and fitting methodology. We argue that these types of uncertainties can be underestimated with common PDF ensembles in high-stake measurements at the Large Hadron Collider and Tevatron. A fruitful approach to quantify these uncertainties is to view them as arising from sampling of allowed PDF solutions in a multidimensional parametric space. This approach applies powerful insights gained in recent statistical studies of large-scale population surveys and quasi-Monte Carlo integration methods. In particular, PDF fits may be affected by the big data paradox, which stipulates that more experimental data do not automatically raise the accuracy of PDFs -- close attention to the data quality and sampling of possible PDF solutions is as essential. To test if the sampling of the PDF uncertainty of an experimental observable is truly representative of all acceptable solutions, we introduce a technique ("a hopscotch scan") based on a combination of parameter scans and stochastic sampling. With this technique, we show that the PDF uncertainty on key LHC cross sections at 13 TeV obtained with the public NNPDF4.0 fitting code is larger than the nominal uncertainty obtained with the published NNPDF4.0 Monte-Carlo replica sets. In the PDF ensembles obtained in the analytic minimization (Hessian) formalism, the tolerance on the PDF uncertainty must be based on sufficiently complete sampling of PDF functional forms and choices of the experiments.

In global QCD fits of parton distribution functions (PDFs), a large part of the estimated uncertainty on the PDFs originates from the choices of parametric functional forms and fitting methodology. We argue that these types of uncertainties can be underestimated with common PDF ensembles in high-stake measurements at the Large Hadron Collider and Tevatron. A fruitful approach to quantify these uncertainties is to view them as arising from sampling of allowed PDF solutions in a multidimensional parametric space. This approach applies powerful insights gained in recent statistical studies of large-scale population surveys and quasi-Monte Carlo integration methods. In particular, PDF fits may be affected by the big data paradox, which stipulates that more experimental data do not automatically raise the accuracy of PDFs -close attention to the data quality and sampling of possible PDF solutions is as essential. To test if the sampling of the PDF uncertainty of an experimental observable is truly representative of all acceptable solutions, we introduce a technique ("a hopscotch scan") based on a combination of parameter scans and stochastic sampling. With this technique, we show that the PDF uncertainty on key LHC cross sections at 13 TeV obtained with the public NNPDF4.0 fitting code is larger than the nominal uncertainty obtained with the published NNPDF4.0 Monte-Carlo replica sets. In the PDF ensembles obtained in the analytic minimization (Hessian) formalism, the tolerance on the PDF uncertainty must be based on sufficiently complete sampling of PDF functional forms and choices of the experiments.

Precision phenomenology at hadron colliders relies upon accurate predictions in the Standard Model (SM). An overwhelming number of such theoretical predictions require parton distribution functions (PDFs) in a proton, the nonperturbative functions f a (x, Q) quantifying probabilities for finding quarks and gluons in a proton at an energy scale Q above 1 GeV. Multiple groups [1] [2] [3] [4] [5] [6] [7] [8] [9] provide increasingly sophisticated parametrizations of PDFs by fitting a growing collection of precise experimental data sets to advanced multiloop calculations. High-luminosity (HL) measurements at the Large Hadron Collider (LHC) and planned DIS experiments (EIC, LHeC, Muon-Ion Colider . . . ), combined with the progress in perturbative QCD calculations, open opportunities both to learn about the PDFs and to find their new applications. The global QCD analysis to determine the PDFs can be now attempted by a broad circle of users thanks to the publicly available xFitter [10] and NNPDF4.0 [8] fitting codes. A recent whitepaper [11] contributed to the Snowmass'2021 Summer Study reviews ongoing progress in the PDF analysis.

In this article, we summarize a study of a rarely discussed source of some observed differences between the published parton distributions. This source is generic and can be especially prominent in the analyses of large data samples dependent on many parameters, notably in large-scale population surveys [14, 15] . We adapt relevant mathematical tools for understanding such surveys to the PDF analysis. The key observation from these studies is that the impact of small biases in sampling of possible solutions, such as in the selection of best-fit models of PDFs obtained using various choices of experiments or functional forms of PDFs, may grow as the volume of fitted data and complexity of the analysis increase. In an (unfortunately non-rare) situation when the sampling is unrepresentative of the population of allowed solutions, one may end up with a wrong conclusion described by Xiao-Li Meng [14] as the bigdata paradox, namely, "the bigger the data, the surer we fool ourselves." Sampling biases must be controlled in highstake phenomenological measurements, such as the recent measurement of W boson mass by the CDF collaboration [16] , together with other types of uncertainties.

The possible existence of an unstated source of the PDF uncertainty is suggested by an observation that, while several recent global analyses constrain the PDFs with comparably strong sets of fitted experimental data, in some phenomenologically important cases these analyses arrive at noticeably different estimates of uncertainties on the PDFs and PDF-dependent predictions. For example, the estimates for the correlated uncertainties on key LHC and Tevatron total cross sections presented in [11, 12] vary between the recent PDF fits. In Fig. 1 , the 95% C. L. uncertainty regions on the Z, Higgs, and W ± total production cross sections at the LHC 14 TeV and CDF II vary in size in a large range. It has been demonstrated that the uncertainties may reflect as much the fitting methodology as the strength of experimental constraints. When CT18 [5] and NNPDF3.1.1 [4] NNLO PDFs were compared in Sec. 2 of the 2021 benchmarking study by the PDF4LHC group [12] , the former systematically predicted a larger uncertainty in the moderate x region than the latter. The magnitudes of the MSHT20 NNLO uncertainties [7] in these comparisons tended to lie between the CT18 and NNPDF3.1 ones. More intriguingly, in the course of the PDF4LHC21 exercise, the three global PDF groups conducted fits to a set of common data, using common settings, so as to establish comparisons/benchmarks. The common data set (termed the "reduced set") was diverse enough to provide constraints [12] , PDF4LHC15 [13] , NNPDF4.0 [8] , CT18 [5] , MSHT20 [7] , ABMP16 [3] , and ATLASpdf21 [9] NNLO PDFs with αs(MZ ) = 0.118. on all PDF flavors, but limited enough so that all groups were expected to find similar estimates of PDF uncertainties.

In the fits to the same reduced data set [12, Sec. 3, especially Figs. 3.4 and 3.5] , the NNPDF3.1 (reduced) uncertainties came out to be systematically smaller than the CT18 (reduced) and MSHT20 (reduced) uncertainties. Such differences are often attributed to the chosen tolerance conventions for the PDF uncertainty adopted by the global analysis groups.

[The tolerance typically accounts for a combination of experimental, theoretical, parametrization, and methodological uncertainties on the PDFs, see the discussion in [17] .] Are the tolerance conventions mostly subjective, or can some conventions perform better than the others? The question is sharpened by formulating it as a problem about sampling of a specific QCD observable that depends on PDF parameters populating a multidimensional space. Section II reviews mathematical essentials that elucidate this question. The trio identity for the sample deviation (Sec. II C) and cornerstone properties of multidimensional (quasi-)Monte Carlo integration (Sec. II D) demonstrate that complex, large-scale analyses are at an elevated risk of an unaccounted sampling bias. Global QCD analyses must strive for representative sampling of all acceptable solutions, which may increase the resulting PDF uncertainties or effective tolerance.

This section also points out fundamental difficulties in performing an all-inclusive test for representative sampling in a multi-parametric global fit. Such sweeping test is likely impractical. On the other hand, a practical question "What is the sampling uncertainty on a given observable X?" can be highly tractable using the already available technology for PDF fits. We point out the general rationale in Sec. II D and present an example application in Sec. III, where we investigate the PDF uncertainty on the LHC benchmark cross sections using the NNPDF4.0 error sets and publicly available mcgen and NNPDF fitting codes. The hopscotch scan+sampling technique introduced there suggests that the PDF uncertainty on key LHC cross sections at 13 TeV is larger than the nominal uncertainty obtained with the published NNPDF4.0 error sets. Section IV contains conclusions.

A. Setup of the problem; the R mechanism We examine the probability for a QCD observable G dependent on the PDFs, such as the QCD coupling strength α s or a collider cross section σ. Predictions for observables are the ultimate targets for the propagation of the PDF uncertainties. The goal of the physics endeavor is to estimate the truth value G truth of G that is objectively realized in Nature. Historically, at most we can hope to determine the expectation value E p (G) = 1 Np Np i=1 G i on the population of many measurements or other determinations G i of G, where N p is a very large number of determinations.

We assume that the determinations G i are properly designed, so that E p (G) agrees with G truth (i.e., E p (G) − G truth is arbitrarily small) for N p that is sufficiently large. For example, for G = α s , the population expectation E p (α s ) could be computed on a future sample of many measurements obtained after several more decades of well-funded research. If G is a cross section σ computed with a multi-parameter PDF ensemble, E p (σ) can be the expectation value with the PDF ensemble that densely and representatively samples the whole parameter space.

The conundrum for many studies is that achieving such large N p may not be feasible. Often one selects a sample of N s replicas from the population, with N s < N p or even N s N p , and estimates the sample expectation value as

In the last step, we expressed the sample expectation E s (G) as a ratio of population expectations E p (RG) and E p (R), where R i is an array of N p "sampling indicators" such that for each element G i of the population

The sample expectation deviation ∆ E ≡ E s (G) − E p (G) ≈ E s (G) − G truth is controlled by the accuracy of each determination, or a replica in the case of PDFs, G i , as well as by the accuracy of sampling of N s replicas from the population. The fitting accuracy/sampling accuracy distinction and the representation using the R indicators ("the R mechanism") are borrowed from the study [14] of large-scale surveys, in which "fits" or "replicas" are equivalent to "responses to the survey". Namely, the accuracy of a single replica G i can be raised by reducing experimental, theoretical, and computational errors. From here, we will assume that the individual G i are sufficiently accurate. In contrast, the sampling accuracy reflects how adequately we sample the population of N p acceptable replicas. If such sampling is biased, the magnitude of the sample deviation can be estimated using the R mechanism, see Eq. (3).

Small biases due to insufficiently representative sampling of large populations may produce large deviations. Surveys of the COVID-19 vaccination rate with very large samples of responses and small statistical uncertainties (e.g., Delphi-Facebook) greatly overestimated the actual vaccination rate published by the Center for Disease Control (CDC) after some time delay [15] . The deviation has been traced to the sampling process. In contrast to the random error, which decreases as 1/ √ N s , the sample expectation deviation can grow with both N p and N s . Concurrently with the formalism for the large population surveys, a related statistical formalism has been developed to understand convergence of quasi-Monte Carlo (QMC) methods for multidimensional integration [18] . Both connections, to the surveys and QMC integration, help us to elucidate our problem in the context of the PDF analysis, in which it can be posed as follows:

Problem. Estimate an expectation value E p (G) of an observable G on a [possibly unknown] population of N p replicas, given a sample of N s values G i , where N s < N p .

To get such estimate, it suffices to adopt an R mechanism that renders ∆ E = E s (G)−E p (G) = 0 within a prescribed error. In this section, we discuss convergence of such sampling estimates.

As a toy example, consider a population of NNLO Higgs boson cross sections G ≡ σ gg→H at the LHC c.m. energy 14 TeV. The cross sections are computed with N p = 900 error sets of the baseline PDF4LHC21 PDF ensemble [12] consisting of 300 MSHT20, 300 NNPDF3.1.1, and 300 CT18 replicas. [The replicas are ordered as in the actual 900replica baseline ensemble. The mean cross section of the CT18 (NNPDF3.1.1) subset is slightly lower (higher) than the population mean.] We have E p (G) = 47.492 pb and wish to obtain a close estimate by sampling only N s = 300 replicas out of 900.

If we randomly select N s = 300 replicas from the whole population, we obtain ∆ E = 0 ± 0.033 pb, where the 68% confidence level (CL) uncertainty is computed by repeating the random selection 1000 times. In this case, E s and E p are statistically indistinguishable. It is known on general grounds that, with the random sampling from the whole distribution, ∆ E decreases as 1/ √ N s , independently of N p [14] . As an instance of a different sampling, let us select 100 replicas from each of the MSHT, NNPDF, and CT18 subsamples, for a total of N s = 300 replicas. In this case, we still get ∆ E = 0 ± 0.03 pb, i.e., no deviation. Since the PDF4LHC21 baseline set of 900 replicas is constructed by randomly selecting 300 replicas from each of the MSHT20, NNPDF3.1.1, and CT18 1000-replica samples, we conclude that this PDF4LHC21 non-random combination prescription introduces no appreciable deviation.

If we select 100 MSHT20 replicas, 200 NNPDF3.1.1 replicas, and no CT18 replicas, we get ∆ E = 0.206 ± 0.023 pb -a significant deviation. In this instance, the bias was introduced by hand, but in realistic situations the bias can arise from apparently small departures from the random (probabilistic) sampling.

The trio identity [14, 18] for the sample expectation deviation ∆ E is a representation introduced to examine convergence of the sampling algorithm. For our problem, the trio identity takes the form

The three factors on the right-hand side are population expectations with different dependencies on N s and N p . In Eq.

reflects the complexity of the population distribution. 1 Measure discrepancy N p /N s − 1 is due to the mismatch between the sizes of the population (N p ) and the sample (N s ). The confounding correlation Corr p [R, G] lies between -1 and 1. It quantifies efficiency of the sampling algorithm in comparison to simple random sampling. The confounding correlation reflects methodology of the analysis. Methodological correlations play a central role in precise PDF analyses [19] , together with data-driven [20] and theory-driven [21] [22] [23] [24] correlations.

If the sampling exercise is repeated N R times while keeping the same N s , each time choosing a different R array, one can estimate a mean-square error (MSE) of the sample deviation for a given R mechanism:

The trio identity establishes dependence of the sample deviation ∆ E on the sampling algorithm [14] .

1. Under simple random sampling (SRS), when replicas are independently selected with identical probability, the sample deviation converges to the truth as 1/ √ N s in compliance with the law of large numbers:

Comparison of Eqs. (4) and (5) shows that

2. For an arbitrary sampling algorithm, the sample deviation satisfies

For the sampling deviation to vanish as N s increases, Corr p [R, G] should decrease at least as fast as o(1/ N p − 1). Absent this behavior, unrepresentative sampling may lead to a situation when the sample deviation remains large in spite of misleadingly small standard error estimates. Meng dubbed this situation as "the big-data paradox", which is clearly undesirable and unfortunately can go unnoticed if sampling accuracy is not controlled to a sufficient degree.

The trio identity elucidates why quasi-Monte-Carlo (QMC) methods for multidimensional integration may converge at a faster or slower rate compared to the Monte Carlo integration based on SRS [18] . When integration is performed over a unit hypercube in N par dimensions, the sample deviation ∆ E coincides with the (hyper)cubature error and can be decomposed into three factors that play the same roles as in Eq. (3).

Of particular interest to us is the convergence of QMC integration when N par is large. In this limit, the minimal number of MC replicas that guarantees a convergent integral for an arbitrary integrand grows as 2 Npar [25] . Not only dense sampling of a high-dimensional volume requires an exponentially growing number N p of replicas, such as 2 Npar ∼ 10 30 for N par = 100; suppression of the confounding correlation to the adequate degree is likely as a daunting feat.

The sample expectation of a QCD observable G( a) in PDF fits is merely an integral of the weighted likelihood function P (D|T ( a)) over N par PDF parameters a:

We immediately conclude that convergence of E s (G) to the truth for an arbitrary G ( a) is not at all guaranteed in a PDF fit that depends on too many parameters and does not control for representative sampling.

In such a complex fit, one practically cannot know if the sample PDF uncertainty covers the truth values for all G( a). On the positive side, it follows from Eq. (7) that, if G( a) is known to substantially depend only on a few components of a, estimation of E s (G) becomes highly tractable. The reason is that the convergence rate of QMC integration is controlled by the effective number of components, i.e. directions in the parameter space, along which the variance of the integrand is significant [26] . If the number of such components is small, integration can be arranged so as to give more weight to the sampling of the manifold spanning the corresponding "large dimensions". For example, the coordinates in the subspace with highest variances of G( a) can be sampled most densely. The coordinates in the complementary subspace with low variances can be either fixed or sampled with a low density. Techniques exist for ranking the N par coordinates according to the variance of the integrand using the Analysis Of Variance (ANOVA) [27] , principal component analysis (PCA), or another dimensionality reduction method. Accuracy of integration can be iteratively improved by adding contributions from the coordinates with lower variances [28] . See discussion in Sec. 8 of [18] .

Experience with high-dimensional integration thus raises a warning for the analyses that fit a large number of flexible functions using a modest number of fitted replicas. While these analyses excel in finding acceptable sets of functions describing the data, they are nevertheless prone to the risk of a sampling bias that grows with the dimensionality of the problem. Apparent reduction of the variance does not eliminate this risk because of the big-data paradox quantified by the trio identity. All-inclusive testing for representative sampling is difficult with a lot of free parameters. Fortunately, typical QCD cross sections depend on specific combinations of PDFs that can be established using data set diagonalization [29] or a related method. Sampling of a known PDF combination can be tested with a greatly reduced cost based on the dimensionality-of-integration argument presented above. Hopscotch scans described in the next section realize such test in practice.

A. CT18 and NNPDF4.0 probability regions for the LHC benchmark cross sections Sampling occurs at several stages of global fits, including selection of fitted experiments, Monte-Carlo (MC) integration or sampling, as well as estimates of uncertainties due to correlated systematic errors, functional forms of PDFs, and fit assumptions. Validation of representative sampling is therefore as essential as the tests of quality of individual fits, such as strong goodness-of-fit tests on resulting PDFs [17] and the closure test [30] of the agreement of a trial fit with a predetermined truth value within the uncertainty. However, for an all-out sampling test, the computation of the confounding correlation in the trio identity (3) requires the population distribution as an input, which is not known while the fits are performed. The confounding correlation can be predicted to a degree by using a model population distribution based on simulated pseudodata in the same spirit as done in the closure test. Instead of testing the full sampling procedure, simpler tests can investigate specific QCD observables. The formalism from survey studies [14, 15] can be applied to such observables by viewing each prediction analogously to an individual response to the survey.

In Fig. 2 , we examine the PDF uncertainties on LHC cross sections at √ s = 13 TeV computed at NNLO in the QCD coupling strength according to the settings listed in the Appendix. For experimental collaborations, it is important to know which theoretical predictions are acceptable given the latest experimental and theoretical constraints. In this exercise, we consider predictions based on the global fits that pass the goodness-of-fit criteria adopted in the CT18 global QCD analysis [5] . The black solid ellipses denote the 68% CL regions obtained with the CT18 NNLO eigenvector (EV) PDFs in the analytical minimization (Hessian) framework [31] . The CT18 uncertainties are constructed so as to cover the solutions obtained in the CT18 fit using a large number of alternative parametrization forms and fit settings that were explored during preliminary CT18 fits. Thus, while the final CT18 PDF ensemble is provided with a single choice of the PDF functional forms, the CT18 PDF uncertainties reflect sampling over many (250-300) alternative parametrization forms, as well as variations in QCD scales of some experiments.

In addition, we provide an alternative CT18Z PDF fit, in which the strangeness and gluon PDFs are modified as a result of (a) including the precision W/Z production data set by ATLAS 7 TeV (4.6 fb −1 ) [32] , (b) using a factorization scale in DIS that mimics small-x saturation, and (c) other changes in the selection of experiments. As with the CT18 ensemble, the nominal CT18Z uncertainty reflects solutions with the alternative parametrization forms and settings. The CT18 and CT18Z error bands are compatible at approximately 90% probability level. Confidence regions based on the CT18Z PDFs are shown as black dashed ellipses. The shifts in the CT18Z predictions, as compared to the CT18 ones, reflect to a large degree the inclusion of the ATLAS W/Z data set [32] which shows substantial tension with the DIS experiments. Taken in the combination, the CT18 and CT18Z confidence regions robustly predict the range of the outcomes based on the various sampling options. The uncertainties obtained with this prescription tend to be somewhat larger than the ones estimated using the dynamic tolerance adopted by the MSHT group. 2 On the other hand, the dynamic tolerance estimates can be fragile if there are large tensions among the experiments [5, App. A.4.b]. The CT approach is more robust to such tensions.

It is interesting to compare the CT18(Z) uncertainties with those from the NNPDF4.0 analysis [8] using either their Monte-Carlo (MC) or Hessian error sets. Recall that CT and MSHT groups perform analytic minimization of χ 2 and provide Hessian eigenvector (EV) sets to estimate PDF uncertainties in applications. In the NNPDF4.0 approach, the error PDFs are constructed by training hyperoptimized neural networks on replicas of randomly fluctuated experimental data. The NNPDF error PDFs are provided in two formats, as the MC replica PDFs and a Hessian set that is obtained by post-fit conversion of replicas. By construction, the Hessian set preserves PDF uncertainties of the MC set. We use these Hessian eigenvectors to establish a Cartesian basis in space of NN PDF solutions. Overlayed on the nominal uncertainties, the light blue ellipses indicate approximate regions containing PDF solutions that have about the same χ 2 according to the NNPDF fitting code as the NNPDF4.0 central replica 0. Specifically, they have ∆χ 2 ≡ χ 2 − χ 2 0 = 0 ± 3, where χ 2 0 is computed for replica 0. The brown ellipses are the analogous regions that contain the PDF solutions with ∆χ 2 = −60 ± 3. The 6-unit width of these intervals was chosen to contain enough replicas within each interval to reconstruct the ellipses according to the procedure explained in Sec. III B. These PDF solutions are linear superpositions of the NNPDF4.0 Hessian replicas available in the LHAPDF library [33] . The χ 2 values are computed using the published NNPDF4.0 code [34, 35] in accord with the convention adopted in the comparison tables in Sec. 5.1 of the NNPDF4.0 publication. We construct the alternative NNPDF4.0 solutions using an algorithm that we call a hopscotch scan, which performs focused sampling of PDF combinations giving the dominant contribution to the PDF uncertainty of the shown cross sections. The technique combines Lagrange Multiplier (LM) scans of PDF parameters [36] along the Hessian eigenvector (EV) directions with stochastic sampling of the few "large" dimensions associated with the largest variance of the shown LHC cross sections, in accord with the general discussion in Sec. II D. Section III B describes the hopscotch scan and presents plots of replica distributions with other ∆χ 2 .

It is obvious from Fig. 2 that the NNPDF-like solutions span substantially larger regions than the nominal 68% NNPDF4.0 uncertainties. While their χ 2 is computed for the same selection of experiments as in the NNPDF4.0 fit, the individual resulting PDFs have been examined and would be acceptable, according to the CT18 procedure, as giving the same or better χ 2 as the nominal fit.

The position and size of the alternative regions associated with the given ∆χ 2 depends on the definition of the χ 2 , which in turn reflects the adopted model for the experimental correlated systematic errors. Figure 7 illustrates the alternative replica clouds obtained with the experimental χ 2 adopted in the tables of the NNPDF4.0 publication, and with the t 0 χ 2 that is minimized when training the NNPDF4.0 replicas. The alternative regions clearly depend on the definition of χ 2 , which in turn reflects ambiguities in the implementation of systematic uncertainties provided by experiments. These are just two out of several χ 2 definitions that are in use by the PDF fitting groups [37] . Dependence on the χ 2 definition must be systematically explored as a part of a more complete exercise, which is beyond the scope of this study. Even within this limited scope, it is clear that there can be many acceptable solutions with ∆χ 2 ≤ 0 outside the nominal NNPDF4.0 uncertainty. Our analysis establishes the lower estimates on the corresponding regions in the cross section planes, noting that the regions are comparable in size to the CT18(Z) ones obtained with the two-tier tolerance and can be shifted further from CT18 than the nominal NNPDF4.0 predictions. 

In this section we describe the construction of the alternative replicas that led to the ellipses of Fig. 2 . The procedure realizes the general considerations in Sec. II D, namely:

1. The NNPDF4.0 Hessian ensemble establishes natural basis coordinates a in space of MC replicas. 2. For a typical QCD observable G( a), the largest variances are associated with 4-8 "large dimensions" in a space.

3. The PDF uncertainty on G( a) can be estimated with a moderate number of MC PDF replicas that vary along the large directions. We generate LHAPDF6 tables for the sample PDF replicas using the mcgen code [38] [39] [40] and the LHAPDF tables of the NNPDF4.0 NNLO Hessian ensemble as the input. The total χ 2 of the NNPDF4.0 analysis was evaluated using the public code released by NNPDF [34, 35] , without refitting. Specifically, the χ 2 is computed by the perreplica chi2 table function of program validphys included in the NNPDF code. The kinematics cuts were fixed to be the same as in the NNPDF4.0 global analysis [8] . The minimum values of Q 2 and W 2 for DIS measurements were hence chosen to be 3.49 GeV 2 and 12.5 GeV 2 , respectively. The first part of this section adopts the experimental χ 2 definition. The second part includes comparisons to the t 0 definition based on the 210713-n3fit-001 reference PDF set provided along with the NNPDF code and adopted in the NNPDF4.0 NNLO global analysis [34] .

The Hessian representation of the NNPDF4.0 ensemble provides the central replica (f 0 ) and N eig = 50 error PDF sets f i corresponding to displacements by a +1σ value (f 0 + ∆f i ) along each independent eigenvector (EV) direction. The total ∆χ 2 i of each EV set, computed with respect to replica 0, varies among the individual EV sets, with some ∆χ 2 i being as large as +35 (for EV 1) or low as −20 (for EV 2), and the majority no more than 5 − 10 units in magnitude. As only one error set is provided per EV direction, this creates an expectation of an approximately symmetric quadratic behavior of ∆χ 2 centered on f 0 . This expectation is illustrated in Fig. 3 for EV set 6 as a red parabola, in which the red points correspond to replica 0 and EV set 6. The horizontal axis is labeled in units of the 1σ displacement for EV set 6, and the vertical axis shows ∆χ 2 .

As an alternative to the red parabola, the actual ∆χ 2 behavior might have been very irregular, which may happen if NN fits show large deviations from Gaussianity. To test which of the two hypotheses is correct, we explicitly computed the ∆χ 2 at green points, for which the LHAPDF tables are constructed as f 0 + w i ∆f i , where the real parameter w i quantifies the displacement on the respective horizontal axis. Figure 4 shows these ∆χ 2 scans for all N eig EV directions. From these χ 2 scans, we find that the ∆χ 2 behavior is consistent with a quadratic one along all EV directions. Blue curves interpolating the green points are consistent with symmetric parabolas whose minima, f i, min = f 0 , are displaced from replica 0 in many EV directions and render negative ∆χ 2 i, min that can be as low as ≈ −40 (for EV 2). The widths of the reconstructed parabolas vary noticeably. These observations strongly suggest the regular, quasi-quadratic behavior of χ 2 in the vicinity of the central NNPDF4.0 replica and the existence of a displaced global minimum in parameter space for which the χ 2 is smaller than the value provided by the central replica.

The hopscotch scan technique explores such low-χ 2 region by focusing on specific QCD cross sections. [Finding the displaced global minimum in the whole 50-dimensional space is more computationally expensive, as complexity of combinatorial and geometrical factors increases quickly.] We draw a low-dimensional "court" based on the χ 2 behavior gleaned from the EV direction scans and then repeatedly "throw a marker" according to one of the strategies to generate the PDF replicas at points inside the court.

For example, to find a region with replicas satisfying ∆χ 2 ≤ T 2 in the plane of two cross sections, such as σ tt and σ Z , we use the interpolated parabolas in Fig. 2 to find up to two "pole" PDF sets corresponding to ∆χ 2 = T 2 for each of 50 EV directions. We plot the {σ tt , σ Z } pair for each pole set, as is done for T 2 = +10, 0, −10, and −20 in the upper left panel of Fig. 5 . In the N eig dimensional space, the pole sets correspond to the corners of a rectangular block whose projection on the {σ tt , σ Z } plane is a polygon with the corners corresponding to the EV directions with the largest displacements of cross sections from the central predictions. In the upper left Fig. 5 , these are EV directions 5, 2, 7, 15, 17, 20, and 10. The other EV directions (examined, but not shown in the figure) generate smaller displacements. For this cross section pair, we generate 2 × 300 replicas in the court consisting of two rectangular blocks spanning the EV directions 5, 2, 7, 15, 17 and 17, 20, 6, 10, 5. A replica is generated in an n-dimensional block as f = f 0 + n i=1 w i ∆f i , where each w i is a random real number that is uniformly distributed along the i-th EV direction between the two corresponding pole sets with ∆χ 2 = T 2 . We then generate the replicas for three more pairs of cross sections: Z vs W ± (summed over the W boson charges, EV directions 2, 7, 23, 20, 17, 5) ; W + vs W − (EV directions 2, 13, 1,17, 14) ; tt vs H (EV directions 8, 15, 17, 4, 2, 5) .

Our cumulative set from all scans contains 1940 replicas to examine solutions with ∆χ 2 ≤ 20. In the right column of Fig. 5 , we use varied colors to plot subsamples of replicas that have ∆χ 2 ± 3 around the ∆χ 2 values specified in the figure. The distribution of these replicas is consistent with the displaced global minimum near which some replicas have ∆χ 2 as low as -84 units. The lowest ∆χ 2 corresponds to the regions populated by brown markers. In the lower row of Figs. 5, we show the dominant EV directions and replica samples for the σ Z vs. σ H pair, which was Δχ exp 2 =-10

Δχ exp 2 =-20 Δχ exp 2 =-10

Δχ exp 2 =-20 not included in the generation of replicas. However, since this pair shares the dominant directions with the sampled cross section pairs, we can predict the PDF uncertainties for this pair as well.

The hopscotch scan is mainly a search algorithm and, in the current realization, does not include any convergence criteria nor the certainty to find the true global minimum. [These aspects can be further developed along the lines discussed in Sec. II D.] The role of the hopscotch is to reduce the dimensionality of the search for solutions with a lower χ 2 and to identify regions in the cross section space corresponding to such solutions.

While our set of solutions is not exhaustive, it can be used to estimate the size of the projected area for a given value of ∆χ 2 , say ∆χ 2 = 0. The sample's convex hull gives a crude boundary of this region. On the other hand, since the EV scans in Fig. 4 are strongly indicative of the approximately Gaussian behavior of χ 2 , it seems reasonable to assume that the populated regions in the cross section planes are approximately elliptical. With this information, a highly effective approach to estimate the boundary is to fit an ellipse to the outermost points of the replica subsample in the cross section plot. The quadratic form describing each ellipse can be computed algebraically using a public Mathematica program from [41] for reconstruction of multi-dimensional ellipsoids from such projections. A 2-dimensional ellipse can be reconstructed by having as few as 6 points on the convex hull of the sample. In our case, due to the selection of the 6-unit widths of the examined ∆χ 2 fringes, the convex hulls contain no less than 15 points per ellipse, so they can be fitted with good certainty.

These approximate elliptical regions for ∆χ 2 = 0 are shown in Figs. 5 and 2 in light blue. In the latter, we also depicted the area corresponding to ∆χ 2 = −60 in brown.

As already noted in Sec. III A, the NNPDF4.0 analysis reports a difference of about -340 units (for 4618 data points) between the total χ 2 values of the central replica in the experimental and t 0 definitions. The two definitions differ in their implementations of experimental correlated systematic errors, an issue that was scrutinized in several past studies. [See, e.g., [37, [42] [43] [44] ]. Other definitions exist in literature, each providing an approximation to the full correlation model, and with some expected to be more resistant to biases when fitting data with relative systematic errors. In particular, the default "extended T " definition used by CT applies the multiplicative treatment to all correlated errors, not only the normalization error as is done in the t 0 definition. Figure 6 compares the scans of the χ 2 in the experimental and t 0 definitions, and Fig. 7 compares the distributions of our replicas satisfying −35 ≤ ∆χ 2 ≤ 0 in the experimental and t 0 definitions in the {σ H , σ Z } plane. The t 0 definition in Fig. 6 is also consistent with an approximately quadratic behavior, its minima along individual EV directions are closer on average to the central replica than with the experimental definition. Nevertheless, substantial shifts persist along some EV directions, notably EV direction 1 associated with the small-x gluon PDF. In Fig. 6 , many replicas with −35 ≤ ∆χ 2 ≤ 0 with the t 0 definition lie outside the nominal NNPDF4.0 uncertainty, even though the difference from the nominal uncertainty is smaller in this case than with the experimental definition.

Altogether, the hopscotch exercise demonstrates the degree to which predictions for LHC cross sections depend on the quality criteria for individual fits as well as the sampling procedures adopted by the groups. To the question: "Which of the generated replicas are acceptable to predict the LHC cross sections?", the answer of the exercise is "Apparently, all of them that have good χ 2 ".

Indeed, upon a closer examination, the generated alternative PDFs for ∆χ 2 ≤ 0 with either definition appear to pass the standard validation adopted in the CT fits. They are linear combinations of well-behaving Hessian sets and are sufficiently smooth and positive in the x region with the data constraints. At Q = 2 GeV, only a few of them are negative in the extrapolation regions, where their behavior can be easily adjusted without changing the agreement with the data. We haven't scrutinized systematically the integrability of T 3 and T 8 , as done in the NNPDF4.0 fit, yet we observed no compelling reason to discard the alternative solutions. We have emphasized that the distribution of replicas with good χ 2 depends on the χ 2 definition. This dependence cannot be neglected at the contemporary accuracy level.

It is possible that deeper examination will reveal the reasons why such replicas are disfavored by the nominal NNPDF4.0 uncertainty. Without knowing the disqualifying reasons, theoretical predictions based on the alternative solutions are acceptable on the same footing as the responses of individuals in a population survey. We make LHAPDF6 grids of the alternative PDF replicas available for the future analyses [45].

PDF uncertainties in high-stake measurements (Higgs cross sections, W boson mass. . . ) should be examined for robustness of results to sampling of available experimental data sets and PDF parametrizations. Sampling biases may arise in PDF fits operating with large populations of possible solutions. Increasing the volumes of the fitted data and parametric space may increase, not reduce, the sample expectation deviation. An undetected deviation may result in a wrong prediction with a low nominal uncertainty. Sampling biases may limit reduction of the PDF uncertainties and explain some differences between the PDF sets.

For these reasons, global fits are potentially vulnerable to unrepresentative sampling when their overall scope (including the number of PDF parameters, size of data sets, range of possible assumptions) grows. As a way to mitigate the risk of underestimation in specific applications, statistical literature suggests to swap democratic sampling in all dimensions for preferential sampling in fewer dimensions that are most relevant to the task at hand.

In the Monte-Carlo (MC) replica method, constructing the Hessian eigenvector (EV) sets from the MC PDF set introduces a convenient coordinate system for such dimensionality reduction. Taking the W boson mass measurements as an example, we could identify the few Hessian sets that give the largest contribution to the M W PDF error. It is then more effective to sample these EV directions with a higher density of replicas to look for acceptable PDFs that may be outside of the nominal MC uncertainty. We presented a technique of hopscotch scans to perform such Furthermore, from the examination of alternative NNPDF4.0 replicas, we find that dependence of the distribution of acceptable predictions on the prescription for implementation of experimental systematic errors cannot be neglected at the targeted level of accuracy. Similar dependence has been observed in the CT fits (see e.g., Sec. 6D in [44] ) and should be examined as a part of the total uncertainty. In either the MC or Hessian methods, a comprehensive range of fits must be explored to understand variations due to the functional forms and other choices. This viewpoint is taken in the CTEQ-TEA family of analyses, in which the tolerance on the fixed PDF functional form of the published set is selected so as to cover candidate best-fit PDFs found with the alternative choices. In other words, one must pay attention both to the quality of accepted fits and their representative sampling. For example, when some experiments disagree, it should be either understood that fitting all experiments at once will either fail the strong goodness-of-fit test [17] or, if such a fit is nevertheless accepted, the tolerance may need to be increased, as the experimental tensions lead to a larger uncertainty on the full population.

Instead of considering a large population of N p acceptable solutions, for specific predictions, the trio identity equation (3) can help to design a procedure that produces unbiased and reliable estimates using a sample of a smaller size N s N p . The overall spirit of this approach is similar to data set diagonalization [29] and replica unweighting [46, 47] . The R mechanism realises a generalization of such techniques and can select fits based on the value of χ 2 or other figures of merits. ment of Energy under grant No. DE-SC0007914, U.S. National Science Foundation under Grant No. PHY-2112829, and in part by the PITT PACC. The work of CPY is partially supported by the U.S. National Science Foundation under Grant No. PHY-2013791. CPY is also grateful for the support from the Wu-Ki Tung endowed chair in particle physics.

In this section, we summarize settings of the computations of LHC cross sections shown in the main part of the article. The cross sections are computed at NNLO in the QCD coupling strength without cuts, unless specified otherwise.

Drell-Yan W ± /Z production. For W ± /Z boson production at the Tevatron 1.96 TeV, we impose the CDF fiducial cuts [16] , W ± : 30 < p ,ν T < 55 GeV, |η | < 1, u T < 15 GeV, 60 < m T < 100 GeV; (A1) Z : 30 < p T < 55 GeV, |η | < 1, u T < 15 GeV, 66 < m ¯ < 116 GeV,

where

For W/Z boson production at the LHC, we adopt the ATLAS 13 TeV fiducial cuts [48] , W ± : p ,ν T > 25 GeV, |η | < 2.5, m T > 50 GeV; (A4) Z : p T > 25 GeV, |η | < 2.5, 66 < m ¯ < 116 GeV.

(A5)

The theoretical calculation is performed with a fast computation table APPLgrid [49] at NLO, combined with NNLO/NLO point-by-point K-factors calculated with MCFM [50, 51] . The renormalization and factorization scales are set equal to the invariant mass of the lepton pair, m ¯ or m ν . Top-quark pair production. Top-quark pair production is measured by both ATLAS and CMS groups at 13 TeV [52, 53] and presented in the form of total cross sections. Here we take the public code top++ [54] to compute these cross sections at NNLO, with the threshold logarithms of soft gluons resummed up to the NNLL level. The factorization and renormalization scales are set to the top-quark mass m t .

Higgs production. The calculation is done with ggHiggs [55] using the factorization and renormalization scales equal to m H .

Associated production of Higgs bosons and top-quark pairs. Recently a part of the NNLO calculation for ttH production came out [56] , while no public code has been released yet. Instead, we make predictions using MadGraph aMC@NLO [57] interfaced with PineAPPL [58] at NLO, and using NNLO PDFs. The renormalization and factorization scales are set to be equal to the partonic collision energy √ŝ .

Combination of measurements of inclusive deep inelastic e ± p scattering cross sections and QCD analysis of HERA data

Constraints on large-x parton distributions from new weak boson production and deep-inelastic scattering data

Parton distribution functions, αs, and heavy-quark masses for LHC Run II

Parton distributions from high-precision collider data

New CTEQ global analysis of quantum chromodynamics with high-precision data from the LHC

Differential Top Quark Pair Production at the LHC: Challenges for PDF Fits

Parton distributions from LHC, HERA, Tevatron and fixed target data: MSHT20 PDFs

The path to proton structure at 1% accuracy

Determination of the parton distribution functions of the proton using diverse ATLAS data from pp collisions at √ s = 7, 8 and 13 TeV

HERAFitter

Snowmass 2021 whitepaper: Proton structure at the precision frontier

The PDF4LHC21 combination of global PDF fits for the LHC Run III

PDF4LHC recommendations for LHC Run II

Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election

Unrepresentative big surveys significantly overestimated US vaccine uptake

High-precision measurement of the W boson mass with the CDF II detector

Hadronic structure in high-energy collisions

The trio identity for Quasi-Monte Carlo error

Correlation and combination of sets of parton distributions

Presentations on studies of data-driven correlations in PDF fits

A first determination of parton distributions with theoretical uncertainties

Parton Distributions with Theory Uncertainties: General Formalism and First Phenomenological Studies

Correlation of theoretical uncertainties in PDF fits and theoretical uncertainties in predictions

Why αs cannot be determined from hadronic processes without simultaneously determining the parton distributions

An intractability result for multiple integration

Valuation of mortgage-backed securities using Brownian bridges to reduce effective dimension

Estimating mean dimensionality of analysis of variance decompositions

When are Quasi-Monte Carlo algorithms efficient for high dimensional integrals?

Data set diagonalization in a global fit

Parton distributions for the LHC Run II

Uncertainties of predictions from parton distribution functions. 2. The Hessian method

Precision measurement and interpretation of inclusive W + , W − and Z/γ * production cross sections with the ATLAS detector

LHAPDF6 library of parton distribution functions

An open-source machine learning framework for global analyses of parton distributions

The NNPDF open source code

Uncertainties of predictions from parton distribution functions. 1. The Lagrange multiplier method

Parton Distribution Benchmarking with LHC Data

A meta-analysis of parton distribution functions

Reconstruction of Monte Carlo replicas from Hessian parton distributions

The mcgen program for generation of Monte-Carlo replicas and algebraic manipulations with LHAPDF6 tables

Direct ellipsoidal fitting of discrete multi-dimensional data

New generation of parton distributions with uncertainties from global QCD analysis

Fitting Parton Distribution Data with Multiplicative Normalization Uncertainties

CT10 next-to-next-to-leading order global analysis of QCD

Reweighting NNPDFs: the W lepton asymmetry

Reweighting and Unweighting of Parton Distributions and the LHC W lepton asymmetry data

Measurement of W ± and Z-boson production cross sections in pp collisions at √ s = 13 TeV with the ATLAS detector

A posteriori inclusion of parton density functions in NLO QCD final-state calculations at hadron colliders: The APPLGRID Project

A Multi-Threaded Version of MCFM

Precision Phenomenology with MCFM

Measurements of top-quark pair to Z-boson cross-section ratios at √ s = 13, 8, 7 TeV with the ATLAS detector

Measurement of the tt production cross section, the top quark mass, and the strong coupling constant using dilepton events in pp collisions at √ s = 13 TeV

Top++: A Program for the Calculation of the Top-Pair Cross-Section at Hadron Colliders

On the Higgs cross section at N 3 LO+N 3 LL and its uncertainty

ttH production at NNLO: the flavour off-diagonal channels

The automation of next-to-leading order electroweak calculations

PineAPPL: combining EW and QCD corrections for fast evaluation of LHC processes

We appreciate related investigations of CT18, MSHT'20, and NNPDF3.