key: cord-1036809-teax66vr
authors: Ghosh, Malay; Ghosh, Tamal; Hirose, Masayo Y.
title: Poisson Counts, Square Root Transformation and Small Area Estimation: Square Root Transformation
date: 2021-10-11
journal: Sankhya B (2008)
DOI: 10.1007/s13571-021-00269-8
sha: 2875140a8b215fe029cfa258a51de4ce98d8defd
doc_id: 1036809
cord_uid: teax66vr

The paper intends to serve two objectives. First, it revisits the celebrated Fay-Herriot model, but with homoscedastic known error variance. The motivation comes from an analysis of count data, in the present case, COVID-19 fatality for all counties in Florida. The Poisson model seems appropriate here, as is typical for rare events. An empirical Bayes (EB) approach is taken for estimation. However, unlike the conventional conjugate gamma or the log-normal prior for the Poisson mean, here we make a square root transformation of the original Poisson data, along with square root transformation of the corresponding mean. Proper back transformation is used to infer about the original Poisson means. The square root transformation makes the normal approximation of the transformed data more justifiable with added homoscedasticity. We obtain exact analytical formulas for the bias and mean squared error of the proposed EB estimators. In addition to illustrating our method with the COVID-19 example, we also evaluate performance of our procedure with simulated data as well.

∼ N (θ i , 1/4) where θ i = λ i (i = 1, ..., m).

Following the customary approach, we consider independent N (x i β, A) priors for the θ i with p-dimensional auxiliary variables x i and regression parameter β ∈ R p where m > p + 4. The posterior θ i |z i We now turn towards empirical Bayes (EB) estimation of the λ i . Writing X = (x 1 , ..., x m ) , Z = (z 1 , ..., z m ) andβ = (X X) −1 X Z, it follows that marginally ||Z − Xβ|| 2 ∼ 1 4B χ 2 m−p . Here X is a m × p matrix with rank p. Following Efron and Morris (1973) , an EB estimator of B isB = m−p−2 4||Z−Xβ|| 2 . Thus an EB estimator of λ i iŝ

For proving our technical results, we find it convenient also to definẽ

which is also an EB estimator of λ i if the shrinkage factor B were known. Also, for notational simplicity we write m 0 = m − p hereafter.

3 Bias ofλ EB For both bias and mean squared error (MSE) calculations forλ EB i we need the following lemmas.

. Then V ar(X 2 ) = 2σ 4 1 + 4μ 2 1 σ 2 1 V ar(Y 2 ) = 2σ 4 2 + 4μ 2 2 σ 2 2 Cov(X 2 , Y 2 ) = 2σ 2 12 + 4μ 1 μ 2 σ 12 4 M. Ghosh et al.

Proof. The result follows from the independence ofβ and z i − x iβ and noting thatβ ∼ N (β, (1/4 + A)(X X) −1 ) while z i − x iβ ∼ N (0, (1/4 + A)(1 − s i )).

for all positive integers k. This proves (i) by an application of Basu's Theorem (Basu, 1955) . Now for any positive integer k, and by part (i) in Lemma 3, we have,

This leads to E

is symmetric random variable around 0 and its all odd moments are 0. Hence,

We get the second equality in the above equations using part (i) of this Lemma. Proof of (iii) is complete after observing k is an odd integer.

Proof. The proof (i) to (iv) follows by noting thatB = m 0 −2 4||Z−Xβ|| 2 and ||Z − Xβ|| 2 ∼ 1 4B χ 2 m 0 . To prove (v) and (vi) we also need part (ii) of Lemma 3. Now we start with calculation of the bias E(λ EB i − λ i ). The following theorem is proved.

and m 0 > 2. Then bias of the EB estimatorλ EB i in Eq. 2.2 for λ i = θ 2 i is given by

Proof. We begin with the partition

By Lemmas 1 and 2,

The first equality in Eq. 3.6 is by independence ofβ and ((z 1 − x 1β ), . . . , (z n − x nβ )). The third equality holds by part (iii) of Lemma 3 since (z

Finally, we simplify A 3 . By (i) and (iii) of Lemma 3 and Lemma 4,

The proof of Eq. 3.2 follows now by combining (3.3) with Eqs. 3.4-3.7.

Remark 1. With the usual assumption,

Remark 2. We can estimate the bias in Eq. 3.2 by replacing the B bŷ B, An unbiased estimator of the bias is

Thus, from Eq. 3.8, the EB estimator has positive bias, and the bias-

Square Root Transformation 7 4 MSE ofλ EB The following theorem provides an exact expression for the MSE ofλ EB .

The proof of Theorem 2 is given in Appendix A.

Remark 3. Theorem 2 shows that the MSE of EB estimatorλ EB i , E(λ EB i − E(λ i )) 2 = O(1) due to the first term in Eq. 4.1 for large m.

In this section we estimate the MSE ofλ EB i provided in Theorem 2 up to the order O(m −1 ) for large m. We now assume that

It is easy to see that only first two terms in Eq. 5.1 do not depend on m and remaining terms are of O(m −1 ). Using Lemma 4 we get

Using Eq. 5.1, we find

, Lemma 4 and the independence ofβ andB, we also have (13) is estimated by

The next theorem shows that MSE of the bias-corrected estimatorλ CEB

. In other words, bias correction does not lead to any significant improvement over MSE(λ EB i ), at least, when calculated up to O(m −1 ).

Hence,

Due to the independence ofβ and Z − Xβ, Cov({x iβ } 2 ,B) = 0. Next, again invoking the symmetry of Z − Xβ around 0, and

Hence Eq. 5.7 reduces to

||Z−Xβ|| 2 is ancillary, and ||Z − Xβ|| 2 is a function of the complete sufficient statistic (β, Z − Xβ), by Basu's Theorem,

Hence, from Eqs. 5.7-5.10, Cov(λ EB i , bias) = O(m −2 ). This along with Eqs. 5.4-5.6 proves the theorem.

In this section, we now deploy our approach on the 2020 COVID-19 pandemic dataset, which is available at usafacts.org. This example is used mainly for illustration. We are using the figures provided as the sampled estimates. Our study shows that the coefficient of determination (R 2 ) does not increase much if we include other demographic variables such as the population size, number of people over age 60, and income in the linear model for the number of deaths regressing on number of confirmed cases. It suggests that the number of confirmed cases is the most crucial variable in estimation of the number of deaths by Coronavirus than the aforementioned demographic variables. We have also studied a few more county-level data sources 1 and we found out that adjusted gross income (AGI) 2 of the year 2017 is really relevant for estimating the number of deaths. In our model, we have transformed the number of confirmed cases and adjusted gross income (AGI) by taking the square root. All data are aggregated at the county level. We are interested in estimating the counts of death due to Coronavirus for all counties in Florida. Here m = 57 since Florida has 57 counties. From Section 2 we know that √ y i ind ∼ N (x i β, A + 0.25) and we estimate β by ordinary least square method. Based on our analysis, we get β = (−0.2786, 0.0917, 0.0003) , the respective estimates for the intercept, number of confirmed cases and AGI. We have summarized our results based on our model in Table 1 and the shrinkage factorB = 0.3777. It seems that our model-based approach seems to pull the direct estimates towards some grand average, as one anticipates in a typical EB analysis. Figure 1 shows that the estimates are higher in south east Florida than the rest of the state.

In this section, we will measure the performance of our model via a simulation study. The choice of the auxiliary parameters is guided by the case study of the previous section. For illustration purposes, we have considered only one covariate-the number of confirmed cases to estimate the number of deaths due to COVID 19. This data is available for 3,142 counties of the United States. For simulation purposes, we have taken a random sample from this data without any replacement for each choice of m. The number of small areas (counties), m, is set to be 25, 50, 100, 200, 500, or 1000. For each choice of m, we generated data from the model :

The design matrix X includes a column of ones and one explanatory variable. To set the value of the parameter for β and A, we first create a linear regression model for the number of deaths on the number of confirmed cases using entire data for 3,142 counties. The estimated value for regression coefficient vector β is (5.281570, 0.000272) and mean square residuals is 22.75. For simulation we set β = (5.281570, 0.000272) and A = 22.75 − 0.25 = 22.50, hence shrinkage factor B = 0.011. Due to this variance stabilizing transformation, the shrinkage factor does not change between counties. Now using Eq. 7.1 we generate λ i and z i for all i = (1, . . . , m) . The explanatory variable is again number of confirmed cases which is simulated randomly without replacement from the entire populations of 3,142 counties in the United States.

Here we will compare the true RMSE and estimated RMSE ofλ EB i . We examine our findings in Theorems 2, 3 and 4 based on six different settings for m. Here we will vary the m and only one dataset is generated for each m, the latter taking values 25, 50, 100, 200, 500, 1000. We have estimated the Figure 2 substantiates that the approximations given in Theorems 3 and 4 are fairly close to the true RMSE. In addition, they also point out one particular small area where the MSE is significantly higher than rest of the small areas.

The paper introduces square root transformation of Poisson count data, and attains approximately both normality of the transformed data as well as variance stabilization. In this way, we obtain explicit estimates of bias and MSE for Poisson means. Based on the simulation, it seems that our estimates closely resemble the truth. Data analysis part tells us that estimates are higher on south-east Florida when the model appropriate. There are many potential extensions. One that immediately comes to mind is consideration of unit level models with corresponding square root transformation. Gonçalves and Ghosh (2021) have addressed this problem using a pure hierarchical Bayesian framework, but an empirical Bayes approach with all its theoretical properties should also be a topic of future investigation. Even under the present framework, one may add a spatial component and using something like a CAR model (see for example, Ghosh et al., 1999) . A final interesting problem is to consider an overdispersed Poisson model, i.e. a negative binomial model for count data with variable transformation as in Yu (2009) which also leads to homoscedasticity.

Funding. The third author's research was partially supported by JSPS KAKENHI grant number 18K12758.

Compliance with Ethical Standards. The Author(s) declare(s) that there is no conflict of interest that are relevant to the content of this article.

Proof. In this section we will do calculate the MSE of the EB estimator

Each term in the right side of Eq. 5.11 will be computed separately. Since

Next we compute the second term in the right side of Eq. 5.11. By Lemmas 2 and Eq. 3.4, Now we evaluate the last expression in right side of Eq. 5.11.

The expressions A 1 , A 2 and A 3 in Eq. 3.5 are functions of residuals ((z i − x iβ )..., . . . , (z i − x iβ )) and the expressions B 1 and B 2 in Eq. 5.18 are functions ofβ. Therefore, (A 1 , A 2 , A 3 ) is independent of (B 1 , B 2 ) and their covariances are 0. The expectation of B 3 is 0 since (

Hence, again using part (v) of Lemma 4, Eqs. 3.5 and 5.18, we have

Hence, from Eqs. 3.4 and 3.7,

Next we compute the remaining third term in the right side of Eq. 5.11. (5.19) In the above calculation, the cross terms E(A 1 A 2 ) and E(A 2 A 3 ) in Eq. 5.19 vanish by part (iii) of Lemma 3. Again, by part (ii) of Lemma 4, we have,

. (5.20)

Again, applying Lemmas 3 and 4, , we obtain,

−4m0 (m0 − 2) 2 + 4 2(m0 + 2) (m0 − 2) 2 ) = 3(1 − si) 2 2m 2 0 − 9m0 + 6 4m0(m0 + 2)(m0 − 4) B 2 − B m0 + 2 + 1 2m0

. (5.21) Finally, once again by Lemmas 3 and 4, and recalling ||Z − Xβ|| 2 = m 0 −2 4B , we get (5.19), 

Theorem 2 follows from Eqs. 5.11, 5.12, 5.17, 5.19 and 5.23 .

On statistics independent of a complete sufficient statistic

Stein's estimation rule and its competitors-an empirical bayes approach

Estimates of income for small places: an application of james-stein procedures to census data

Hierarchical bayes glms for the analysis of spatial data: an application to disease mapping

Benchmarked empirical bayes methods in multiplicative area-level models with risk evaluation

Unit level model for small area estimation with count data under square root transformation

Arc-sin transformation for binomial sample proportions in small area estimation

Mean-squared error estimation in transformed fay-herriot models

Transforming response values in small area prediction

Variance stabilizing transformations of poisson, binomial and negative binomial distributions

Acknowledgements. The authors are grateful to the editor and anonymous reviewer(s) for their constructive comments and suggestions which greatly improved an earlier version of this article.