key: cord-0118270-g7dlgfvo
authors: Liu, Zhan; Valliant, Richard
title: Investigating an Alternative for Estimation from a Nonprobability Sample: Matching plus Calibration
date: 2021-12-01
journal: nan
DOI: nan
sha: 06a84bcc715c4ff56f1f853636bdcfa75440e8cf
doc_id: 118270
cord_uid: g7dlgfvo

Matching a nonprobability sample to a probability sample is one strategy both for selecting the nonprobability units and for weighting them. This approach has been employed in the past to select subsamples of persons from a large panel of volunteers. One method of weighting, introduced here, is to assign a unit in the nonprobability sample the weight from its matched case in the probability sample. The properties of resulting estimators depend on whether the probability sample weights are inverses of selection probabilities or are calibrated. In addition, imperfect matching can cause estimates from the matched sample to be biased so that its weights need to be adjusted, especially when the size of the volunteer panel is small. Calibration weighting combined with matching is one approach to correcting bias and reducing variances. We explore the theoretical properties of the matched and matched, calibrated estimators with respect to a quasirandomization distribution that is assumed to describe how units in the nonprobability sample are observed, a superpopulation model for analysis variables collected in the nonprobability sample, and the randomization distribution for the probability sample. Numerical studies using simulated and real data from the 2015 US Behavioral Risk Factor Surveillance Survey are conducted to examine the performance of the alternative estimators.

Probability samples have been the standard for finite population estimation for many decades.

However, probability samples can have many nonsampling problems like low contact and response rates or missing data for units that do respond. Response rates in US surveys, in particular, have been declining for at least two decades (Brick and Williams, 2013) . Since nonprobability samples can be faster and cheaper to administer and collect, some organizations are gravitating toward their use (Terhanian and Bremer, 2012) . Baker et al. (2013) review the reasons that nonprobability samples, like volunteer internet panels, may be used rather than a probability sample. Among them are lower costs and compressed data collection periods. Quick turnaround can be especially important to gauge public well-being in health crises like the COVID-19 pandemic of 2020.

There are a variety of problems with nonprobability samples, especially among persons in a panel that have been recruited to participate in future surveys (e.g., see Baker et al., 2013; Valliant and Dever, 2011; Valliant et al., 2018) . These include selection bias, coverage error, panel nonresponse, attrition, and measurement error. We concentrate on the use of matching and calibration to adjust for the first two of these-selection bias and coverage error. Selection bias occurs if the sample differs from the nonsample in such a way that the sample cannot be projected to the full population without some type of statistical adjustment. Coverage error can occur if, for example, a volunteer panel consists of only persons with access to the Internet, assuming that the entire population of a country is the target of the survey. Other, more subtle forms of coverage error can occur if certain demographic groups would rarely or never participate in the particular type of nonprobability survey being conducted.

Because the selection of a nonprobability sample is not controlled by a survey designer, estimation methods other than standard design-based approaches are needed. At least six alternatives can be considered for weighting and estimation with nonprobability samples:

(1) Naïve method where all units are assigned the same weight

(2) Quasi-randomization where a pseudo-inclusion probability is estimated for each nonprobability unit (3) Superpopulation modeling of analysis variables (Y 's) (4) Doubly robust estimation where quasi-randomization and superpopulation modeling are combined (5) Mass imputation of Y 's into a probability sample using values from a nonprobability sample to form an imputation model (6) Matching of a nonprobability sample to a probability sample whose units are used as donors of weights to the nonprobability sample

The naïve method of equal weighting is rarely, if ever, appropriate because nonprobability samples are not generally distributed proportionally across demographic or other important groups in the population. Alternatives (2)-(4) were reviewed by Elliott and Valliant (2017) and Valliant (2020) and further studied by Chen et al. (2020) . Wang et al. (2020) refined alternative (2) by kernel-smoothing the propensity weights. Alternative (5) was proposed by Kim et al. (2021) and

involves fitting an imputation model using data from the nonprobability sample and imputing Y values to the units in the probability sample using that model. The probability sample with imputed values is provided to analysts but not the nonprobability sample. Mass imputation solves the weighting problem by using the weights associated with the probability sample. The dissertation of Wang (2020) studied a version of (6) in which a kernel-smoothing method was used to proportionally distribute the probability sample weights to units in the nonprobability sample. We study another, somewhat simpler version of alternative (6), and particularly address some problems with the method.

Both a probability sample, denoted by S p and a nonprobability sample, denoted by S np will be used in subsequent sections. The target population for which estimates are made is U and has N units. To examine properties of estimators, three distributions will be used. Expectations taken with respect to the sample design used to select the probability sample S p will be denoted by a π subscript. The probability of selection of unit i in S p is π i . To analyze the nonprobability sample S np , we assume that its units are selected by an unknown quasi-randomization distribution; expectations taken with respect to that distribution will be specified by a subscript R. The probability that unit j is included in S np is R(x j ) where x j is a C-vector of covariates or auxiliaries associated with unit j. To simplify notation in later sections, we set R(x j ) ≡ R j . The analysis variable Y will also be assumed to be generated by a superpopulation model, ξ. Consider the linear model for Y i defined by

(1.1)

where β is a C × 1 parameter vector, and the ǫ i are independent, random errors with mean zero and variance σ 2 i . Theory for nonlinear models can also be worked out, as in Chen et al. (2020) , but a linear model is convenient for purposes here. Under model (1.1), the expected value of the population total,

The remainder of the article is organized as follows. Section 2 describes how matching can be applied to obtain basic weights for units in the nonprobability sample and reviews the methods of matching. Section 3 presents the theory for bias and variance in different situations. Section 4 investigates properties of matched estimators when the nonprobability sample is calibrated to population totals of covariates. In Section 5, the sample matching and the calibration adjustment are applied in a simulation study using artificial data. In Section 6, an application to a real population is conducted to evaluate the performance of the proposed estimates. The last section summarizes our findings.

Sample matching has been an option for estimating treatment effects in causal inference for some time (e.g., see Cochran, 1953; Rubin, 1973; Rosenbaum and Rubin, 1983) . Moreover, it has been widely applied in evaluation research, observational studies and epidemiological studies (Rothman et al., 2008) . More recently, it also has been applied as a way of identifying a sample in market research, public opinion surveys (e.g., Vavreck and Rivers, 2008; Terhanian and Bremer, 2012) , and other nonprobability sampling surveys, especially using volunteer panel surveys. Baker et al. (2013) review some of the applications of matching in survey sampling. Its purpose in nonprobability sampling surveys is to reduce selection bias and to estimate population characteristics. Another application of statistical matching is to overcome the problem of missing data created when some persons do not consent to having their survey responses linked to administrative databases (Gessendorfer et al., 2018) .

The basic idea of sample matching in survey sampling is that first a random, probability sample, S p , is selected from the sampling frame of the target population. This probability sample is matched to a pool of nonprobability cases, e.g., a volunteer panel of persons. The resulting matched sample from the nonprobability pool is denoted by S np . The probability sample should have none of the coverage problems of the nonprobability sample. This probability sample is also called a reference sample (Lee, 2006) and can be an existing survey (or subsample of one) rather than one specially conducted to serve as the reference. For example, in the US the American Community Survey (ACS, https://www.census.gov/programs-surveys/acs) is one possibility for a large, well-conducted household, reference survey. The probability sample should be representative of the target population in the sense that it can be used to make unbiased and/or consistent estimates of population quantities. We assume that S p does not include the Y variables for which estimates are to be made; these are collected from the nonprobability sample.

The application of matching described by Rivers (2007) is one in which S p is a simple random sample (srs). The nonprobability sample S np is obtained by a one-to-one match of S p to a much larger pool of nonprobability cases, yielding a set S np of the same size as S p . Since S p was treated as an srs, every unit in S np was given the same weight. When S p is an srs, the distribution across various characteristics of S np is expected to be the same as that of the population. However, in an evaluation of the nonprobability samples offered by nine commercial vendors, Kennedy et al. (2016) found that a nonprobability sample may still produce biased estimators even though it had the same demographic distribution as the population. In other words, matching to an srs S p to obtain S np can be inadequate without further weighting. Rivers and Bailey (2009) describe an election polling application where the sample was obtained by matching, as described above, but inverses of estimated propensities of being in the nonprobability sample were used as weights.

Sample matching in alternative (6), as applied in this paper, fits into the quasi-randomization approach. Each unit in a probability sample is matched to a unit in the nonprobability sample based on a set of covariates. The logical extension of Rivers (2007) is for the probability sample unit to "donate" its weight to the matched, nonprobability sample unit. The intuitive argument to justify this is that if the nonprobability units match the probability units on an extensive list of covariates, then the S np units are exchangeable for the S p units, S np constitutes the same sort of sample as S p , and the units in S np can be weighted in the same way. This approach has the advantage of straightforward retention of all analytic data collected in the nonprobability sample unlike alternative (5) which could require a separate imputation model for every Y variable.

The probability sample used for matching can be larger, smaller, or equal in size to the nonprobability sample, although the method in which S np is selected to have the same size as S p has advantages. If a pool of nonprobability units is used that is much larger than the probability sample, finding a close match for each unit in the probability sample may be more feasible. This would be the case when a large panel of volunteers has been accumulated. If the nonprobability sample is smaller, a unit in S np may be matched to more than one unit in S p , making it unclear how to weight the S np cases. In this article, we assume that the resulting sample size of the matched, nonprobability sample, S np , equals the sample size of the probability sample, S p , that it is matched against. Denote this sample size by n.

Which matching algorithm to use is a question. There are various algorithms, including nearest neighbour matching, caliper and radius matching, stratification and interval matching, as well as kernel and local linear matching (Caliendo and Kopeinig, 2008 (Dehejia and Wahba, 2002) . To overcome this problem, single nearest neighbour matching with replacement and multiple nearest neighbour matching were proposed to increase the average quality of matching and reduce the bias (Smith and Todd, 2005) . Other matching methods have been suggested that use more than one unit in the nonprobability pool as the matching unit for an individual in the probability sample. Caliper and radius matching use this approach.

Another issue in sample matching is that the matching process will become relatively more difficult as the number of relevant covariates increases. This is the curse of dimensionality noted by Rosenbaum and Rubin (1983) . In order to solve this problem, they propose the propensity score, which is the conditional probability of receiving a treatment given the covariates X, denoted by p(X) = P (D = 1|X), where D is the binary indicator taking either the value 1 (receiving treatment, e.g. participation in a volunteer panel) or 0 (not receiving treatment). Rosenbaum and Rubin (1983) have proved that matching on the propensity score p(X) is also valid when it is valid to match on the covariates X. Compared with direct matching based on all covariates, propensity score matching can reduce multiple dimensions (many covariates) to a single dimension, greatly simplifying the matching process. Consequently, it has been widely used in medical and epidemiological studies, economics, market research and a host of other fields (Schonlau et al., 2009; Baker et al., 2013) .

In this section, we introduce estimators of means and totals based on the matched sample S np and derive their properties. An estimator of a population total is

where w j is the weight from the probability sample unit that is matched to unit j in S np and y j is the Y value observed for that unit. We assume that these weights are appropriately scaled for estimating population totals. In particular,N M = j∈Snp w j is an estimator of N , the size of the target population. The mean of Y is estimated by Y M = Y M / N M . We will consider two cases of weighting of the probability sample S p :

(i) The weight used for each unit in S p is the inverse of the selection probability of that unit, i.e., w j = π −1 j ; the estimator of the total with this weight is denoted byŶ M 1 subsequently;

(ii) The S p weights are those for a general regression (GREG) estimator; the estimator with this weight is denoted byŶ M 2 .

Note that the GREG in case (ii) includes the commonly used poststratification estimator. Whether those weights are related to the pseudo-inclusion probabilities of the units in S np largely determines whether Y M 1 and Y M 2 are biased or not as shown below. The arguments given are largely heuristic, although they can be formalized using technical conditions like those in Chen et al. (2020) .

Properties of estimators can be calculated in several ways: with respect to the ξ-model only,

with respect to the R-distribution only, with respect to the π-distribution, or with respect to a combination of the distributions. In subsequent sections, we compute biases and variances with respect to the combined Rπ-distribution. The Rπ calculation is analogous to the design-based calculations used in much of sampling theory. In addition, bias and variance calculations are made with respect to the ξ-model and combined Rπξ-distributions. The calculations made with respect to the ξ-distribution are conditional on the S np and S p samples. In principle, ξ calculations are more reflective of the statistical properties for the particular sets of units in S np and S p .

Taking the expectation of Y M 1 − Y U under case (i) with respect to the pseudo-randomization distribution only gives

If R j = π j , then Y M will be R-unbiased. However, this does not have to be true generally. For example, if R j = P r (j ∈ S np | x j ) is a complicated logistic function of a set of covariates that were not used in determining w j , the estimator is R-biased. Another situation leading to Rbias would be when the pseudo-inclusion mechanism is nonignorable, i.e., P r (j ∈ S np | x j , y j ) = P r (j ∈ S np | x j ). Since in a probability sample, the selection mechanism is always ignorable, w j = 1/P r (j ∈ S np | x j , y j ) when inclusion in the nonprobability sample depends on Y .

If the expectation is taken over the Y -model, the result is

The ξ-bias is non-zero unless X np (π) = X U . If X np (π) is an unbiased estimator of X U under the quasi-randomization R-distribution, Y M 1 will be unbiased when averaged over both the R-and ξ-distributions (and, equivalently, over the R, π, and ξ distributions). But, as for Y M 1 , X np (π) will be biased if the correct R-model is not linked to the S p weights, i.e., if

If the weights from the probability sample have been calibrated to population totals of some covariates x, the bias calculation changes somewhat. Take the case of the general regression (GREG) estimator being used for S p . That is, w j = g j /π j where

The values of σ 2 j are often set to a constant in practice, but for completeness, we include σ 2 j in subsequent formulas. If σ 2 j 's are used in estimators of totals, they will be generally be assumed values of the model variances in (1.1); but, we do not require that σ 2 j = σ 2 j . Note also that the π j 's must be available separately for each unit in the probability sample in order to recover A p separately from the w j . In some public-use files, users may only be presented with the w j = g j /π j and not π j .

The estimator of the total is then

where Y np (π) = j∈Snp y j /π j . The ξ-bias is

If R j = π j and sampling for S np and S p is ignorable, reasonable assumptions are that N A −1

Taking the expectation of (3.4) with respect to the R-and π-distributions shows that Y M 2 is approximately Rπξ-unbiased, but this depends on R j = π j for all units in S np . Under the same conditions (i.e., R j = π j and N A −1 p and N −1 A np (π) converging),

The results in sections 3.1 and 3.2 can be summarized as follows:

• Case (i), w j is the inverse of the selection probability for its matched unit in the probability

i.e., the probability of a unit's being in the nonprobability sample, S np , equals its probability of being in the probability sample, S p ;

-Y M 1 is Rπξ-unbiased when the linear model (1.1) holds and R j = π j ;

The key requirement (in addition to ignorability) for unbiasedness of any type is that the observation probability of a unit in the nonprobability sample should be equal to the selection probability of its matched unit from the probability sample. This seems unlikely to be exactly true in most applications.

Since a variance estimator is mainly useful in a situation where a point estimator is unbiased or consistent, we concentrate on the case where R j = π j and Y M 1 is R-unbiased. Calculation of the variance of Y M 1 with respect to the pseudo-inclusion probability distribution depends on the joint distribution of the indicators, {δ j } j∈U where δ j = 1 if j ∈ S np and 0 if not. If the δ j have the same joint distribution as that of the indicators for being in the probability sample, then S np can be treated as having the same sample design as S p . If so, V R Y M 1 = V π Y M 1 , and the variance estimator for Y M 1 would be determined by the sample design for S p . For example, if the probability sample was a stratified, cluster sample, then the variance estimator appropriate to that design would be used.

When w j = π −1 j , a more realistic assumption, given the way that nonprobability samples are often acquired, is to treat the {δ j } j∈U as being independent. With that assumption, the R-variance

can be estimated with a formula appropriate for a Poisson sample. Another option is the formula for a sample selected with replacement and with probabilities equal to

and (3.5) can be interpreted as an estimator of either. Estimator (3.5) is convenient because it is the default in survey software packages like R survey, Stata, and SAS. However, as shown in Appendix A.1, v Rπ is a biased estimator of the model variance given below. and White, 1985) . Note that, because the Y 's are not available in the probability sample, we must estimate β from the nonprobability sample.

The Rπξ-variance, in general, is equal to

As shown in Appendix A.1, for case (i) with R j = π j , this is

Notice that, even thoughŶ M 1 does not directly depend on x, the Rπξ-variance does after accounting for the ξ-model structure. Expression (3.8) can be estimated by

where v R ( X np ) is, for example, a version of (3.5) adapted to estimate a covariance matrix.

The

(3.10)

As noted in Appendix A.2, the estimator of total can be approximated by

and the approximate Rπ variance is

Note that, in the situation studied here, both terms of the variance in (3.12) have the same order of magnitude, O(N 2 /n), since they are based on samples of the same size. Thus, the Rπ-variance is the variance in the nonprobability sample of the estimator with inverse pseudo-inclusion probability weights plus a term reflecting the variance of the estimator of the x-totals in the probability sample.

Since Y np (π) = Y M 1 , (3.12) also implies that the Rπ-variance of Y M 2 with calibrated S p weights is larger than that of the uncalibrated Y M 1 . This disagrees with the usual expectation that calibration on an effective predictor of Y reduces variances. To better understand this, note that if the matched

x's in S p and S np were identical, then X p = X np and the variable part of (3.11) could be written as a weighted sum over S np of residuals, which can then be used to show that Y M 2 can have a smaller variance than Y M 1 . However, with imperfect matching the relationship in (3.12) becomes more realistic.

(3.14)

A natural estimator of (3.14) is then

Consequently, there are several options for variance estimation for Y M 2 for cases (i) and (ii).

They can be summarized as:

-Estimate the quasi-randomization (Rπ) variance with the with-replacement estimator in (3.5);

-Estimate the Rπξ-model variance with v Rπξ in (3.9);

• Case (ii), w j = GREG weight from S p -Estimate the ξ-variance with (3.10)

-Estimate the Rπ-variance with (3.13);

-Estimate the Rπξ-variance with (3.15);

The R−, Rπ−, or Rπξ-bias of the matched estimators, Y M 1 and Y M 2 , in section 3 depend critically on whether P r (j ∈ S np | x j ) = P r (i ∈ S p ) for matched units i and j. Matching on covariates attempts to ensure this; however, there is no guarantee that the condition is satisfied regardless of how extensive the set of covariates is.

Consequently, one might hope that calibrating the weights for the nonprobability sample will provide some bias protection. Suppose that the { w j } j∈Snp weights are calibrated to the X U population totals using the chi-square distance function associated with a GREG. Using the standard formula from Särndal et al. (1992, , eqn. 6.5 .10), the resulting weight for unit j is

(Note that σ * 2 j does not have to be the same as σ 2 j used in constructing the GREG weight in S p .) As in section 3, σ * 2 j is often set to a constant in which case it drops out of the formula for w * j . The matched, calibrated estimator is

Snp w j x j y j /σ * 2 j . As in section 3, calculations depend on cases (i) and (ii) of the w j weights. When case (i) weights are used from S p , the calibrated estimator will be denoted by Y M C1 ; when case (ii) weights are used, Y M C2 denotes the calibrated estimator in subsequent sections.

When w j = π −1 j , X M = X np (π) and, after calibration, the estimator of the total can be written as

on the x's in the ξ-model yields an ξ-unbiased estimator even if R j = π j .

To calculate the Rπ-expectation, define B

By the same type of Taylor series argument as in Särndal et al. (1992, sec.6.5) ,

In case (ii) with w j = g j /π j and g j defined in (3.2), the matched estimator after calibration

As show in Appendix A.3, the calibrated estimator of the total is approximately

This bias occurs even though the nonprobability sample is calibrated on the x's in the model for Y .

The bias results for the matched, calibrated estimators Y M C1 and Y M C2 can be summarized as follows:

• Case (i), w j = π −1 j and the w j are then calibrated to population x-totals

• Case (ii), w j is the GREG weight for its matched unit in S p and the w j are then calibrated

to population x-totals -Y M C2 is ξ-biased even if (1.1) holds and the nonprobability sample S np is calibrated on the x's in the model;

If case (i) holds where the weights assigned to matched units are inverses of selection probabilities from S p , the situation is more straightforward than case (ii). R-unbiasedness in case (i) requires that the pseudo-inclusion probabilities can be taken from the probability sample, i.e., R j = π j .

Nonetheless, in case (i) calibrating the nonprobability sample does produce an ξ-unbiased estimator even if R j = π j , as one would hope. However, in case (ii) when the weights from the probability sample are calibrated and the nonprobability sample is further calibrated on the same x's, the resulting estimator is not ξ-unbiased.

To compute the ξ-model variance, note that the estimator of total can also be written as Y M C1 = Snp g * j y j /π j with g * j defined in (4.1) with w j = 1/π j . The ξ-variance is then

To compute the R-and Rπ-variance, we use the approximation in (4.3). Assume that R j = π j so that Y M C1 is R-unbiased. Based on results in section 4.1, the estimator can be approximated as

where e * j = y j − x T j B * U . The R-(and Rπ-) variance is, thus, equal to the variance of the first term in the last line of (4.6). If the sample S np is treated as being selected with replacement, then a variance estimator is

(4.7)

As shown in Appendix A.4, approximation (4.4) can be rewritten as

(4.8)

Rewriting (4.4), the calibrated estimator of the total is also

where e * j was defined above. Using the total variance formula, the Rπ-variance can be derived as

with v R Snp e * j π j being a variance estimator of an estimated total appropriate to how the nonprobability sample is handled. We use B np (π) in (4.10) rather than an estimator with w weights since the former is expected to be somewhat more stable. If S np is treated as being with-replacement, the first component in (4.10) can be computed with (4.7).

(4.11)

For each of the variance estimators above for the matched, calibrated estimator in cases (i) and

(ii), it is important to remember that unless R j = π j the estimator of total itself will be biased. If so, the mean square error will have a bias-squared component that none of the variance estimators will reflect.

In the combination above, both the weights in S p and those in S np are calibrated to a given set of x's. This is similar to the situation studied by Rao et al. (2002, p.368) , who noted that in a regression with calibration weights, GREG residuals are based on the regression of model residuals on X. If the model fits well, there will be very little association between those residuals and X leading to no gain compared to an estimator not using calibration weights. In our situation, when the estimators of totals are unbiased, we can expect Y M 2 with calibration in S p , Y M C1 with no calibration in S p and calibration in S np , and Y M C2 with calibration in both S p and S np to be about equally precise-a point borne out by the simulation in section 5.

To study the performance of the proposed estimators described above, we performed two simulation studies with an artificial population. In the first, conditions are created where close matches can be found between units in the probability sample and the nonprobability sample. In the second simulation, close matches are much less likely.

In the simulation, a finite population of size N = 100, 000 was based on the following model:

where α = 0.4, β = 0.25, σ 2 = 0.0625, and X follows a gamma distribution with density function f (x) = 0.04x exp(−x/5). This is the same model as used by Hansen et al. (1983) ; the function HMT in the R PracTools package was used to generate the population. Conditional on X, Y follows a gamma distribution with density function g(y; • Y M C1 , estimator (4.2) with 1/π weights from the matched units in S p followed by calibration in S np , and

• Y M C2 , estimator (4.2) with GREG weights from the matched units in S p followed by calibra-tion in S np .

The above process is repeated 5000 times. The percentage relative biases (relbiases), the variances and the mean squared errors of the matched estimator and the matched, calibrated estimator under cases (i) and (ii), are presented in Table 1 . The empirical percent relative bias is defined as

For comparison we included a doubly robust estimator, denoted byŶ DR , that was computed without matching. This estimator was computed in two steps as described in Elliott and Valliant (2017) . First, an equal probability subsample of n = 250 was selected from the volunteer panel of m = 1250. Then, S p and S np are combined. Units in S np are given a weight of 1 while units in S p were assigned their sampling weight of 1/π i . A logistic regression with X as the covariate was run to predict the probability of being in S np . The weight for unit j in S np was then calculated as w j = (1 −R j )/R j whereR j is the predicted probability of being in S np (see Wang et al., 2021) .

Without the odds transformation, the estimator would be somewhat biased (Chen et al., 2020) , but in this case the bias was negligible since S np is a small fraction of the population (Wang et al., 2021) . Finally, the estimator was calibrated with a model having an intercept and X. noteworthy is the fact that the doubly robust estimator,Ŷ DR , has a 21% larger MSE than the best of the matching estimates. This is a consequence of the logistic model used to estimate P r (j ∈ S np ) being a misspecification.

In addition to the point estimators, the variance estimators of the matched estimator and the matched, calibrated estimator under cases (i) and (ii) are also computed according to equations 

is a variance estimator of Y computed from the b th simulated sample, and B = 5000 is the total number of simulation runs. The percent relative biases (RB) and 95% confidence interval (CI) coverages using the normal approximation and the different variance estimates, are presented in Table 2 .

With three exceptions, the relbiases in Table 2 are small, ranging from -2.2% to 3.1%. An exception is v ξ ( Y M 1 ) which is a 15.8% underestimate due to the fact that it does not account for the variability ofX np as shown in section 3.3. The Rπ and Rπξ estimators for Y M 2 are about 22%

overestimates. As explained in Appendix A.2, these estimators will not fully account for precision gains due to calibration of weights in S p when the x-matches are extremely close. Confidence interval coverage ranges from 94.6% to 96.7% except for v ξ ( Y M 1 ) which covers in 92.3% of samples due to its underestimation. 

0.6 0.7 94.9

In this simulation, we consider a case in which R j = π j , j ∈ S np . The same finite population of size N = 100, 000 is used as in simulation study I along with a stratified, probability sample S p of size n = 250. A volunteer panel of expected size m = 1250 is selected from the population using Poisson sampling with selection probabilities π ′ i defined as follows:

With this definition of π i , the probability of being in S np decreases with increasing X. This kind of selection for the volunteer panel will generally result in R j = π i , for a unit j ∈ S np matched to a unit i ∈ S p .

As in simulation I, single nearest neighbor matching without replacement based on the variable X is adopted to conduct matching for the probability sample. The matched estimator, the matched, calibrated estimator and their variance estimators under cases (i) and (ii) are computed. The above procedure is repeated 5000 times. The relative biases, the variances and the mean squared errors are listed in Table 3 . Also, the same relative biases and 95% CI coverages of variance estimators as those in simulation study I are displayed in Table 4 .

In Table 3 

-5.9 -6.1 94.0

To further assess the performance of the matching estimators, they are applied to data obtained through telephone household interviews. The analytic variables Y in this study are whether respondents were ever diagnosed with a heart attack (CVDINFR4), were ever told by a medical professional that they have diabetes (DIABETE3), and were ever told they had a stroke (CVDSTRK3). Although each of these analysis variables is binary, use of linear estimators, as studied in previous sections, is standard survey practice, largely because of their convenience for data analysts.

Covariates associated with Y are sex, age, race, marital status, physical weight, employment status, education level, income level, whether respondents smoked at least 100 cigarettes in their entire life, and whether respondents participated in any physical activities or exercises in the past 30 days in 2015. All of the variables are shown in Table 5 .

After deleting cases with either a missing, a don't know or a refused response to any of these variables, 315,669 persons are available for this study. Two weights are provided with the dataset:

X WT2RAKE, which is a design weight and X LLCPWT, which is a raked, final weight. According to the documentation (https://www.cdc.gov/brfss/annual_data/2017/pdf/weighting-2017-508.pdf), BRFSS rakes the design weight to eight margins (gender by age group, race/ethnicity, education, marital status, tenure, gender by race/ethnicity, age group by race/ethnicity, and phone ownership). The raking also serves as a noncoverage/nonresponse adjustment. Because of the asymptotic equivalence of the GREG and raked estimators shown by Deville and Särndal (1992) , the earlier theory in sections 3 and 4 should apply to estimators based on X LLCPWT. In this dataset of 315,669 persons, 256,949 people who had used the internet in the past 30 days are considered as the web (nonprobability) subset. Using the X LLCPWT weights, the web population is only 84% (81% unweighted) of the target population, indicating that the effect of coverage error could be substantial. Moreover, the weighted distributions of the categorical covariates among all respondents in the web, non-web, and full populations are given in Table 6 . Categories of some variables are combined in Table 6 and in the simulation compared to the categories in Table 5 because they are small. Table 7 gives the proportions that reported a heart attack, diabetes, or a stroke in the web, non-web, and full populations. As shown in Tables 6 and 7 , there are differences between the target population and the web and non-web populations in the estimated distributions of some of the covariates. For example, 0.19 of the full population are age 65 or older, 0.14 of the web population are, and 0.45 of the non-web are 65+. In the target population, 0.59 are employed for wages, 0.65 are in the web population, but only 0.30 of the non-web are. About 8% of the web population have a grade 11 education or less while 13% of the full population does; 33% of the web population attended four or more years of college while 29% of the full population did. For the analysis variables in Table 7 , 4.3% of the target population have ever been diagnosed with a heart attack while 3.1% of the web population and 10.7%

of the non-web population have. Similar differences occur for diabetes and stroke. Although the percentage point differences are small between the web and full populations, the relative differences are substantial. For example, heart attacks in the web population are 72% (0.031/0.043) of those in the full population; diabetes in the web population is 80% of the full population rate; strokes in the web population are 72% of those in the full population. Consequently, calibrating the matched sample may reduce bias and variance as long as the covariates in Table 6 are predictive of the Y 's. However, it is clear that weighting a sample from the web population will have to achieve a considerable amount of bias correction in order to produce good estimates for the full, target population.

Also noteworthy are the substantial differences between the web and non-web subpopulations.

The non-web people are older, more likely to be Black and non-Hispanic, more likely to not be in the labor force, less educated, lower income, and more likely to have smoked than the web persons.

The non-web people are also much more likely to have had heart attacks, diabetes, and strokes. Our focus is on using a sample from the web population to make estimates for the full population, but any attempt to use a sample from the web population to represent the non-web population seems doomed to failure. In general, a nonprobability sample that has serious coverage problems cannot be expected to produce good estimates for poorly covered domains. To apply the proposed matching method, simple random samples are selected from the BRFSS web subsample and from the BRFSS full sample. Using equal probability sampling preserves any differences between the web and full samples and, in particular, any coverage defects in the web sample. The size of the S p probability sample was n = 500 while the size of the initial S np web sample was M = 3000. The BRFSS raked weights for persons in S p were adjusted to equal w j = (N/n) * X LLCPWT where N = 315, 669. Since the BRFSS design weights did not include a nonresponse adjustment and, consequently, did not sum to an estimate of the size of the target population, we computed a nonresponse-adjusted design weight for each person in S p asw πj =

The samples, S p and S np , are combined and the propensity of being in S p is estimated via logistic regression. The n closest matches in S np , found using the R package Matching, are retained for estimation. The matching reduces the size of S np to be the same (n = 500) as that of S p .

The weightsw j andw πj from the matching person in S p are assigned to person j in S np . These weights were used to calculate estimated proportions, (Lumley, 2020) .

We also computed two versions of the doubly robust estimator for comparison. The two alter-natives differed in the propensity model used. The first, Y DR1 , used a propensity model with the same covariates as the calibration model for Y M C1 and Y M C2 . The second, Y DR2 , used a propensity model that included an intercept, the interactions of INCOME2 with X AGE, EDUCA with X AGE, and INCOME2 with EDUCA. These interactions were determined from a regression tree analysis, and the covariates were recoded for the interactions to be binary. INCOME2 was recoded to less than or greater than or equal to $25,000; X AGE to less than 55 years or greater than or equal to 55 years;

EDUCA to less than high school or high school or more. The logistic propensity model for being in S np based on the merged dataset of S p and S np was estimated using the method described in Wang et al. (2021) . For both doubly robust alternatives, the same calibration model was used as

This process was repeated 5, 000 times for each of the three analysis variables. The relative biases, the variances and the mean squared errors (MSEs) of the three point estimators across the 5, 000 samples are summarized in Table 8 . For all three analysis variables the biases of Y M 1 and Y M 2 are positive, ranging from 4.8% for diabetes with M 1 to 15.7% for heart attack for M 2. Recall that M 1 is a type of π-estimator with the π-weight taken from the matched case in the probability sample. In this example, M 2 is a raked estimator with the weight being the raked weight from the matched case in S p . In contrast, the M C1, M C2, DR1, and DR2 estimators have serious negative biases, ranging from -21.6% to -17.5%. The ordering of the MSEs varies, although DR2 has the smallest MSE for two of the three analysis variables. None of the alternatives is able to correct for the undercoverage by the the web sample of the full population. Finally, as an experiment we also increased the sample sizes to n = 1000 for the nonprobability sample and M = 5000 for the initial probability sample. The increased sample sizes had no effect on the biases of the point estimates of means. (Results are omitted here.) 

In this article we present several alternative estimators when a nonprobability sample, S np , is matched to a probability sample, S p . The general setting is that the nonprobability sample is weighted by assigning the weight from an S p unit to its matched unit in the nonprobability sample.

Particular cases are (i) the weight from S p is its π-weight, (ii) the weight from S p is a GREG weight, (iii) case (i) with the nonprobability sample being calibrated with a linear model, and (iv) case (ii) with S np calibrated with a linear model. Under some restrictive conditions, these estimators can be approximately unbiased. The key requirement is that the actual propensity of a unit's being observed in the nonprobability sample should be equal to the inclusion probability of the unit that it is matched to in the probability sample.

Three simulation studies illustrated several points about the matched estimator and the doubly robust estimator, which is included for comparison. Study I used artificial data where the variable to be analyzed follows a linear model with a single covariate X, which was also used to create strata. The sample designs for both S p and S np were stratified simple random sampling with the design for S np treated as unknown. In this case, matching on X was reliable and all estimators were unbiased. In fact, three of four of the matching estimators had a smaller MSE than the doubly robust estimator.

The second simulation used the same artificial population and S p sample design as Study I, but

S np was selected with probabilities (treated as unknown) that decreased with X. In this example, the inclusion probabilities for the nonprobability sample are far from those in the probability sample used for matching. Consequently, the matched estimators without calibration are biased. However, calibration corrects the biases and the matched, calibrated estimator has a smaller MSE than the doubly robust estimator.

The third simulation used a real population (the US Behavioral Risk Factor Surveillance Survey, BRFSS) in which persons who had accessed the internet in the previous 30 days were treated as a nonprobability sample from the full US adult population. Since there was no control over how the nonprobability units were selected, this mirrored a situation that would be faced in practice. The prevalence of three health conditions was estimated. The prevalences differed considerably between the part of the population that was covered by S np and the part that was not. The persons who did not use the internet were older, less educated, lower income, and less healthy than the internet users. These differences led to all estimators in the study being biased. Calibrating the matching estimators on a list of covariates did not correct the biases. In addition, doubly robust estimation, which has been touted as one of the better options, had substantial biases that were larger then those of the best matching estimators.

The failure in the real data study has several, potential contributing factors, including poor matches between the nonprobability and probability units, inadequate models for the propensity of being observed in the nonprobability sample, and poor calibration models for predicting the health characteristics analyzed. However, the facts that the nonprobability sample does not cover the target population, and the noncovered units differ both on the distributions of the analytic variables and covariates is the critical problem. Some diagnostics have been devised for detecting non-ignorability of selection of a nonprobability sample (e.g., see Andridge et al., 2019; Little et al., 2019) . These diagnostics will signal non-ignorability if the means of covariates in S np and the target population are sufficiently different. Thus, they might be a way forward in the BRFSS application.

However, if the variables to be analyzed differ between S np and the target population but covariate distributions do not, the diagnostics will not alert an analyst to trouble, and poor inferences will still be made from the nonprobability sample. The type of coverage error in the BRFSS study is an example of what can happen in nonprobability samples, generally, and may be a problem that no amount of sophisticated mathematics is likely to correct.

This appendix shows the details of variance calculations given in earlier sections. Several assumptions are used in the results below. These apply as N and n → ∞.

(i) π j = O(N/n), R j = O(N/n) and n/N → 0

(v) When R j = π j , N −1Ã p and N −1Ã np (π) both converge in probability to

asymptotically multivariate normal with mean 0.

To compute the ξ-expectation of v R Y M 1 in section 3.3 under case (i), define r j = w j y j − 1 n j ′ ∈Snp w j ′ y j′ . Since w j = π −1 j , this can be rewritten as

The ξ-expectation of r 2 j is then

Adding and subtracting σ 2 j /π 2 j in the second term, summing over S np , and doing some algebra leads

as noted in section 3.3. That is, v Rπ is an overestimate of the model variance under (1.1). However, because Y M 1 is model-biased, v Rπ will not appropriately estimate the ξ mean square error despite its overestimating the ξ-variance.

To derive the Rπξ-variance, note that

Using the independence of the Y 's under (1.1), the first term is U σ 2 j /π j . The second term is

Combining gives the expression shown in (3.8).

A.2 Variance of Matched Estimator Y M 2 under Case (ii)

Following similar steps to those in Särndal et al. (1992, sec.6 .6) and using condition (vii), Y M 2 can be approximated as

Using the formula for total variance across the R and π distributions (denoted by V Rπ ) gives

Working term by term in (A.2) and using the approximation to Y M 2 in equation (A.1), we have

because Y np (π) has zero R-variance given that S np is fixed. To get the second term in (A.2), note

assuming that X p is π-unbiased. Combining these results, the variance across the R-and π-distributions is

as shown in (3.12).

Turning to the Rπξ-variance, the total variance formula is given by (3.7) . The E π V ξ term is Snpσ 2 j E π (g 2 j )/π 2 j . Using a Taylor series approximation as in Särndal et al. (1992, sec.6 .6), we have

It follows that

Thus,

Under the order assumptions at the beginning of this appendix, the first term above is O(N 2 /n) while the second is O(N 2 /n 2 ). Thus, we use the approximation

The second term in (3.7) is E R V π E ξ Y M 2 . Expanding and collecting terms gives

Under condition (v) above,Ã −1 pÃ np converges to the C × C identity matrix and E R V π E ξ Y M 2 = β T V π X p β.

The third term in (3.7) is V R E π E ξ Y M 2 . First, compute E π E ξ Y M 2 = E π Snp g j π j x j β . Using the approximation to g j in (A.3), E π (g j ) .

= 1 and E π E ξ Y M 2 . = X np (π)β. Consequently, the third term is V R E π E ξ Y M 2 . = β T V R X np (π) β. Combining results for the three terms in (3.7) gives V Rπξ Y M 2 . = U R j σ 2 j π 2 j + β T V π X p β + β T V R X np (π) β as shown in (3.14).

When S p has case (ii) weights, w j = g j /π j with g j defined in (3.2). The matched estimator after calibration then equals

Multiplying g * j by g j defined in (3.2) and substituting in the formula for Y M C2 gives To compute the ξ model variance under case (ii), we break U x j y j / σ 2 j and U x j y j /σ * 2 j into sums over S np and U − S np . Equation (A.6) can then be expressed as

Applying conditions (ii) and ( The Rπξ-variance can be calculated using the total variance formula in (3.7). First, when R j = π j , E R E π V ξ Y M C2 | S p , S np = U σ 2 j /π j . The second term in (3.7) is

The third term in (3.7) is

Rewriting the term in brackets above leads to

Applying condition (v) implies that V R E π E ξ Y M C2 . = 0. Combining results for the three terms in (3.7) yields

as shown in (4.11).

Indices of non-ignorable selection bias for proportions estimated from non-probability samples

Report of the AAPOR task force on non-probability sampling

Explaining rising nonresponse rates in cross-sectional surveys

Some practical guidance for the implementation of propensity score matching

Doubly robust inference with non-probability survey samples

Matching in analytical studies

Propensity-score matching methods for nonexperimental causal studies

Calibration estimators in survey sampling

Inference for nonprobability samples

Statistical matching as a supplement to record linkage: A valuable method to tackle nonconsent bias

An evaluation of model-dependent and probability sampling inferences in sample surveys

Evaluating online nonprobability surveys, vendor choice matters: widespread errors found for estimates based on blacks and hispanics

Combining non-probability and probability survey samples through mass imputation

Propensity score adjustment as a weighting scheme for volunteer panel web surveys

Measures of the degree of departure from ignorable sample selection

survey: analysis of complex survey samples

Some heteroskedasticity consistent covariance matrix estimators with improved finite sample properties

Estimating equations for the analysis of survey data using poststratification information

Sample matching for web surveys: Theory and application

Inference from matched samples in the 2008 U.S. national elections

The central role of the propensity score in observational studies for causal effects

Modern Epidemiology

Matching to remove bias in observational studies

Model Assisted Survey Sampling

Selection bias in web surveys and the use of propensity scores

Multivariate and propensity score matching software with automated balance optimization: The Matching package for R

Does matching overcome LaLonde's critique of nonexperimental estimators

A smarter way to select respondents for surveys?

Comparing alternatives for estimation from nonprobability samples

Estimating propensity adjustments for volunteer web surveys

Practical Tools for Designing and Weighting Survey Samples

PracTools: Tools for Designing and Weighting Survey Samples

The 2006 cooperative congressional election study

Improving External Validity of Epidemiologic Analyses by Incorporating Data from Population-Based Surveys

Improving external validity of epidemiologic cohort analyses: a kernel weighting approach

Adjusted logistic propensity weighting methods for population inference using nonprobability volunteer-based epidemiologic cohorts