key: cord-0078541-i1rebjui
authors: Wu, Jingjing; Abedin, Tasnima; Zhao, Qiang
title: Semiparametric modelling of two-component mixtures with stochastic dominance
date: 2022-05-24
journal: Ann Inst Stat Math
DOI: 10.1007/s10463-022-00835-5
sha: f03c428fc8d3426356a730e58975728ba5b14df2
doc_id: 78541
cord_uid: i1rebjui

In this work, we studied a two-component mixture model with stochastic dominance constraint, a model arising naturally from many genetic studies. To model the stochastic dominance, we proposed a semiparametric modelling of the log of density ratio. More specifically, when the log of the ratio of two component densities is in a linear regression form, the stochastic dominance is immediately satisfied. For the resulting semiparametric mixture model, we proposed two estimators, maximum empirical likelihood estimator (MELE) and minimum Hellinger distance estimator (MHDE), and investigated their asymptotic properties such as consistency and normality. In addition, to test the validity of the proposed semiparametric model, we developed Kolmogorov–Smirnov type tests based on the two estimators. The finite-sample performance, in terms of both efficiency and robustness, of the two estimators and the tests were examined and compared via both thorough Monte Carlo simulation studies and real data analysis. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s10463-022-00835-5.

We consider the two-sample model introduced by Abedin (2018) . Specifically, suppose there is a sample from a two-component mixture population H = (1 − )F + G , where the mixing proportion is strictly between 0 and 1, and F, G and H are cumulative distribution functions (c.d.f.) such that F ≠ G and F and G satisfy the stochastic dominance constraint F ≥ G . In addition, we have a separate sample identified as from the first component F. As a result, the data and model are formularized as where f, g and h are the corresponding probability density functions (p.d.f.) of F, G and H, respectively. Here f, g and are all unknown. The problem of our interest is to make inferences for , treating f and g as nuisance parameters, and to estimate the likelihood of a new observation coming from a particular component.

A motivating example that made model (1) arise is the problem to identify differentially expressed genes in case-control studies (e.g., contract vs. not contract of genetic data, such as in Chen and Wu (2013) , Kharchenko et al. (2014) , Lu et al. (2015) and Ficklin et al. (2017) , among many others. In order to solve the problem, frequently a proposed test is applied repeatedly to each single gene. The distribution of those thousands of test statistic values is thus a twocomponent mixture (1 − )f + g , where f is the p.d.f. of the statistic under the null hypothesis that a particular gene is not differentially expressed while g is that under the alternative hypothesis that a particular gene is differentially expressed, and is the proportion of differentially expressed genes. Usually f is much easier to derive theoretically than g, otherwise if f is unknown in practice, in many studies pathologists or experts can confidently identify some genes that are not differentially expressed so that we have a sample (test statistic values of those genes) from f. The latter case is exactly model (1). The dominance constraint F ≥ G is very intuitive in many cases where the statistic values of marker genes are likely larger (or smaller) than those of non-marker genes (e.g., Student's t, ANOVA test statistic). In this motivating example one may argue that the genes, and thus the values of a particular statistic, are not i.i.d. as assumed in model (1). Nevertheless, the distribution of the statistic over all genes is assumed nonparametrically unknown which will provide enough flexibility to weaken, if not remove, the effect of dependence. By Bayes' rule the probability of a gene with test statistic value y being a marker gene is given by Once , f and g are estimated, one can estimate according to (2) the probability that a particular gene is differentially expressed.

Besides the motivating example, model (1) could also be used to model many other real data structure. For more examples readers are referred to Abedin (2018) (1)

(2) p(y) ∶= P(marker gene|y) = g(y) (1 − )f (y) + g (y) .

and Wu and Abedin (2021) which, to our best knowledge, are the only works in literature on model (1). In these two works, the authors proposed and studied two estimators. The first one is based on c.d.f. estimations with use of the dominance inequality, while the second is a maximum likelihood estimator (MLE) based on multinomial approximation. Another work on a model closely related to model (1) is given in Smith and Vounatsou (1997) . However their model doesn't assume the stochastic dominance constraint but instead makes a generally stronger assumptions on the p.d.f.s f and g. This paper is organized in the following way. In Sect. 2, we introduce a semiparametric mixture model in order to accommodate the stochastic dominance constraint. A maximum empirical likelihood estimator (MELE) and a minimum Hellinger distance estimator (MHDE) of the unknown parameters are proposed in Sects. 3 and 4, respectively. Their asymptotic properties such as consistency and asymptotic normality are also presented in these two sections. Section 5 is devoted to testing the validity of the proposed semiparametric model with use of the MELE and MHDE. The finite-sample performance of the two proposed estimators and the goodness-offit tests are assessed in Sect. 6 via thorough Monte Carlo simulation studies, while the implementation of the proposed methods are demonstrated in Sect. 7 through two real data examples. Finally the concluding remarks are presented in Sect. 8. Some conditions for the theoretical results are deferred to Appendix, while the derivations and proofs of the theorems and lemmas are given in a separate supplementary document to save space. Abedin (2018) and Wu and Abedin (2021) proposed and studied two consistent estimators for model (1). However, due to the non-identifiability of the model in general, it is difficult to obtain an estimator with good asymptotic properties such as asymptotic normality. Therefore, we introduce in what follows a semiparametric model which will be proved identifiable and will ensure the stochastic dominance.

Let Z denote a binary response variable and Y the associated covariate. Then the logistic regression model is given by where r(y) = (r 1 (y), … , r p (y)) ⊤ is a given p × 1 vector of functions of y, * is the intercept parameter and = ( 1 , … , p ) ⊤ is the p × 1 coefficient parameter vector. In case-control studies data are collected retrospectively. For example, a random sample of subjects with disease Z = 1 ('case') and a separate random sample of subjects without disease Z = 0 ('control') are selected with Y observed for each subject. Let = P(Z = 1) = 1 − P(Z = 0) . Let f(y) and g(y) denote the conditional p.d.f.s of Y given Z = 0 and Z = 1 respectively, then it follows from the Bayes' rule that (1) is reduced to the semiparametric mixture model with f ∈ F and the parameter vector of interest = , ,

The relationship (3) between two p.d.f.s was first proposed by Anderson (1972) . It essentially assumes that the log-likelihood ratio of the two p.d.f.s is linear in the observations. With r(x) = x or r(x) = (x, x 2 ) ⊤ , it has wide applications in logistic discriminant analysis (Anderson, 1972 (Anderson, , 1979 and case-control studies (Prentice and Pyke, 1979; Breslow and Day, 1980) . For r(x) = x , (3) encompasses many common distributions, including two exponential distributions with different means and two normal distributions with common variance but different means. Model (3) with r(x) = (x, x 2 ) ⊤ also coincides with the exponential family of densities considered in Efron and Tibshirani (1996) . Moreover, model (3) can be viewed as a biased sampling model with the 'tilt' weight function exp + ⊤ r(x) depending on the unknown parameters and . Note that the test of equality of f and g can be regarded as a special case of model (4) with = 0. Qin and Zhang (1997) discussed a goodness-of-fit test for logistic regression based on case-control data where the first sample comes from the control group f and independently the second sample comes from the case group g. They proposed a Kolmogorov-Smirnov type statistic to test the validity of (3) with r(y) = y . When data from both the mixture and the two individual components satisfying (3) are available, Qin (1999) developed an empirical likelihood ratiobased statistic for constructing confidence intervals of the mixing proportion. For the same model and data structure, Zhang (2002) proposed an EM algorithm to calculate the MELE while (Zhang, 2006) proposed a score statistic to test the mixing proportion. Chen and Wu (2013) employed (3) to model differentially expressed genes of acute lymphoblastic leukemia patients and acute myeloid leukemia patients.

For model (4) with r(y) = y , if > 0 then we can easily check that p(y) in (2), the probability of y being from g, is a monotonic increasing function. Further we can prove in the next theorem that if > 0 , then the stochastic dominance constraint F ≥ G is implied by (3). We call (1, r(y)) ⊤ linearly independent on the support, say , of f if (1, r(y)) ⊤ , as a vector of functions of y, is linearly independent over .

(3) g(y) = exp + ⊤ r(y) f (y),

Theorem 1 If (1, r(y)) ⊤ is linearly independent on the support of f, then model (4) with parameter space (5) is identifiable. Specially, (4) is identifiable when r(y) = y ; if further > 0 , then F ≥ G.

Even though Theorem 1 tells us that the condition (3) is stronger than the original stochastic dominance constraint, the resulting semiparametric mixture model (4) is identifiable and has better interpretation than the nonparametric mixture model (1). In addition, the estimation of (4) may possess better asymptotic properties than those of (1). One may also consider higher order polynomial for r(y) and find sufficient conditions for F ≥ G . For model simplicity, we focus on model (4) with r(x) = x and > 0 (to ensure the dominance F ≥ G ) throughout this paper. We first propose two estimators for this model.

Consequently, the empirical likelihood function of (4) with r(y) = y and > 0 is subject to constraints > 0 , 0 < < 1 , p i ≥ 0 , ∑ m+n i=1 p i = 1 , and ∑ m+n i=1 p i e + T i = 1 . The constraint > 0 won't limit the use of model (4) and proposed estimation methods, as we can always switch F with G if < 0 in model (4). Though in this work we assume both F and G are two continuous populations, we can see that the above likelihood is also valid for discrete populations. To find the MELE, we use the Lagrange multipliers and maximize By taking partial derivatives we obtain (details in the supplementary document) where N = n∕(m + n) with N = m + n . Therefore ignoring a constant, the empirical log-likelihood function is

Let ̂M ELE = (̂M ELE ,̂M ELE ,̂M ELE ) ⊤ denote the MELE of , i.e., the maximizer of the log-likelihood function l in (7). In our numerical studies, we used Optim in the R-package to find ̂M ELE . Then the MELE of p(y) in (2) is

We now examine the asymptotic properties of the MELE ̂M ELE . Define with l given in (7), and let

where with Since f ∈ F and r(x) = x , implying ∫ e y f (y)dy < ∞ , it is easy to show that all the S ij 's defined above are finite.

(11) w 1 (y) = 1 − + e + y and w 2 (y) = 1 − + e + y .

Lemma 1 Assume N → as N → ∞ , with 0 < < 1 . Then S N P ⟶ (1 − )S as N → ∞ , where S N and S are defined in (9) and (10) respectively and P → denotes convergence in probability.

The condition N → with 0 < < 1 , as N → ∞ , in Lemma 1 requires that the sample sizes m and n converge to infinity at the same order. With the results in Lemma 1, the following theorem gives the asymptotic normality of the MELE ̂M ELE . Let where with w 1 and w 2 defined in (11).

Theorem 2 Assume N → as N → ∞ with 0 < < 1 and S defined in (10) is invertible. Then under model (4) with parameter space (5) and some regularity conditions (for MELE in general; see Qin and Lawless, 1994) ,

(1− ) S −1 VS −1 with S and V defined in (10) and (12) respectively, and D → denotes convergence in distribution. In addition, V and further are positive definite.

In Theorem 2, the assumption of S being invertible is generally true except for some special cases such as = 0 , or = = 0 , or f is a singleton. When 0 < < 1 , ≠ 0 and f is continuous (or an even weaker condition that the support of f contains at least

two points), we would expect S being invertible. With the results in Theorem 2, one can easily make MELE-based inferences, such as Wald test, about the parameter when the unknown in is replaced with some consistent estimators such as the MELE.

Though MELE is efficient, it is generally non-robust against outliers and model misspecifications. As a robust alternative, we propose in this section an MHDE for the unknown parameters in model (4).

The Hellinger distance between two functions f 1 and f 2 is defined as ‖f where ĥ is an appropriate nonparametric estimator of h . MHDE was first introduced by Beran (1977) for fully parametric models. Beran (1977) showed that the MHDE for parametric model has both full efficiency and good robustness properties. Lindsay (1994) outlined the comparison between MHDE and MLE in terms of robustness and efficiency and showed that MHDE and MLE are members of a larger class of efficient estimators with various second-order efficiency properties. However, the literature on MHDE for mixture models is not redundant. Lu et al. (2003) considered the MHDE for mixture of Poisson regression models. MHDE of mixture complexity for finite mixture models was investigated by Sriram (2006, 2007) . Recently, MHDE has been extended from parametric models to semiparametric models. Wu et al. (2010) proposed an MHDE for two-sample case-control data under model (3) and investigated the asymptotic properties and robustness of the proposed estimator. Xiang et al. (2014) proposed a minimum profile Hellinger distance estimator (MPHDE) for the two-component semiparametric mixture model studied by Bordes et al. (2006) where one component is known and the other is an unknown symmetric function with unknown location parameter. Wu et al. (2017) and Wu and Zhou (2018) proposed algorithms for calculating the MPHDE and the MHDE, respectively, for the two-component semiparametric location-shifted mixture model. Inspired by these works, we propose to use the MHDE to estimate the parameters in (4).

In model (4), even though and can possibly take any value on the real line, we can essentially use intervals that are large enough to cover their true values. So for practical purpose, we can assume that ∈ with a compact subset of ℝ 3 . To give the MHDE for model (4), note that the MHDE defined in (13) 

Intuitively, we can use an appropriate nonparametric density estimator, say f m , based on X 1 , … , X m to replace f and apply the plug-in rule to estimate h t by

With another appropriate nonparametric density estimator, say h n , of h based on Y 1 , … , Y n , we define the MHDE of = ( , , ) ⊤ as That is, ̂M HDE is the minimizer t of the Hellinger distance between the estimated parametric model ĥ t and the nonparametric density estimator h n . In this work, we use kernel density estimators where K 0 and K 1 are kernel p.d.f.s and bandwidths b n and b m are positive sequences such that b m → 0 as m → ∞ and b n → 0 as n → ∞ . The MHDE of p(y) in (2) is given by Note that in (15) we do not impose any restriction on ĥ t to make it a density function. The reason behind this is that a t ∈ that makes ĥ t not a density can still make h t a density. The true parameter value may not make ĥ a density, but it is not reasonable to exclude as the estimate ̂M HDE of itself. As the explicit expression of ̂M HDE does not exist, one needs to use iterative methods such as Newton-Raphson to numerically calculate it. Karunamuni and Wu (2011) has shown for parametric models that with an appropriate initial value, even one-step iteration works well and gives a quite accurate approximation of MHDE.

We now examine the asymptotic properties of the MHDE ̂M HDE given in (15) (15).

With the results given in Theorem 3, the consistency of our proposed MHDE can be proved and is presented in the following theorem.

Theorem 4 Let m, n → ∞ as N → ∞ . Suppose that (D1) holds for any ∈ and the bandwidths b m and b n in (16) and (17) (16), (17) and (14), respectively.

We derive in the next theorem an expression of the bias term (19) is invertible, (C2) and assumptions in Theorem 4 hold. Then it follows that where ̂M HDE is defined by (15) and R N is a 3 × 3 matrix with elements tending to zero in probability as N → ∞.

From this result, we obtain immediately the asymptotic distribution of the MHDE.

In Theorems 5 and 6, the assumption of ( ) being invertible is generally true except for some special cases such as = 0 , or = = 0 , or f is a singleton.

Since MELE is a likelihood-based method, we expect MELE asymptotically more efficient than MHDE. Looking at the asymptotic covariance matrices given in Theorems 2 and 6 for MELE and MHDE respectively, we note that roughly speaking |S ij | is smaller than | ij | while |̄i j | is much bigger than both, due to the fact that the denominator w 1 or w 2 can control the exponential terms in the integrands. As a result, the asymptotic variances of the MELEs of , and are smaller than those of the corresponding MHDEs.

To see the asymptotic relative efficiency of the MHDE with respect to the MELE, we take the normal mixture (1 − )N(0, 1) + N( , 1) , a model examined also in our later simulation studies, as an example and Table 1 below presents the asymptotic covariance matrices of the MELE and MHDE for = 0.5 and various parameter settings = .5, 1, 2 (i.e., ( , ) = (−.125, .5), (−.5, 1), (−2, 2) respectively) and = .05, .20, .50, .80, .95 . From Table 1 we see that, as expected, the elements in the covariance matrices of the MELE are almost always smaller than those of the MHDE, in absolute values. The difference in the asymptotic variances of the MELE and MHDE of or is very small for smaller ( = .5 ) and smaller values. The relative efficiency of the MHDE with respect to the MELE deteriorates when increases or, especially when increases. We also observe from Table 1 that the asymptotic variances of both the MELE and MHDE of and decrease when increases, which is expected by the fact that the information of and is contained in the second component only and the amount of such information increases when the mixing proportion of the second component increases. Comparatively, when increases, the asymptotic variances of both the MELE and MHDE of increase first, reach their maxima around = 0.5 and then decrease. This phenomenon can be explained by the fact that the estimation of is associated with a Bernoulli random variable with parameter and the variance of this random variable, (1 − ) , as a function of has the same increasingdecreasing trend.

In this section we discuss the validity of the semiparametric mixture model (4) or equivalently the relationship (3) between f and g, with r(x) = x . Several goodnessof-fit test statistics for testing the relationship (3) in case-control studies are available in literature; see, for example, Qin and Zhang (1997) , Zhang (1999 Zhang ( , 2001 Zhang ( , 2006 , and Deng et al. (2009) . For our semiparametric mixture model (4), we will construct Kolmogorov-Smirnov (K-S) type statistics based on the proposed MELE and MHDE. 

The idea of K-S test statistic is to use the discrepancy between two c.d.f. estimates, one with the model assumption and the other without, to assess the validity of a model. For model (4) with r(x) = x , we can use the empirical c.d.f. based on the first sample X i 's as the first estimate and the MELE or MHDE based on both samples X i 's and Y i 's exploiting (4) as the second. We first look at the special case of testing = 0 in model (4). When = 0 , we also have = 0 , which implies the equality of the two components F and G and further the equality of F and H. The two-sample K-S statistic for testing the equality of F and H is where N = m + n , F , Ĥ and F 0 denote the empirical c.d.f.s based on X i 's, Y i 's and T i 's respectively, and (T 1 , … ,

Note that F and Ĥ are the nonparametric MLEs of F and H respectively without the assumption of F = H , whereas F 0 is the nonparametric MLE of F with the assumption of F = H . Now consider the general case of testing (4) (6) and (23) Qin and Zhang (1997) , but they used it for case-control data instead of our more complicated mixture model (4). Intuitively, we can also use the MHDE ̂M HDE given in (15) for ̂ , and then we denote the resulting F in (23) and KS in (22) as F MHDE and KS MHDE , respectively.

In our later numerical studies, we use bootstrap procedure to find the approximated distributions and critical values for KS MELE and KS MHDE . To generate bootstrapping data, we randomly select independent samples X * i 's from dF(x) and Y * i 's from (1 −̂+̂ê+̂x)dF(x) , where ̂ and F are either the MELEs ̂M ELE and F MELE or the MHDEs ̂M HDE and F MHDE , respectively. Note that both X * i 's and Y * i 's are selected from the pooled data (T 1 , … , T N ) but with different probability distribution

function. This means that some of the selected X * i 's could be values in the original second sample Y i 's and some of the selected Y * i 's could be values in the original first sample X i 's. Let (T * 1 , … , T * N ) denote the combined bootstrapping sample and ̂ * = (̂ * ,̂ * ,̂ * ) ⊤ be either the MELE or the MHDE based on the bootstrapping samples X * i 's and Y * i 's. Then we can calculate the empirical function F based on X * i 's, the quantities in (6) and the function in (23) 

In this section we examine through Monte Carlo simulation studies the finite-sample performance of the proposed MELE and MHDE of the parameters in model (4) and the proposed K-S tests of this model. In our simulation study, we consider mixture of normals, Poissons or uniforms given in Table 2 . We can easily check that all the five models satisfy the stochastic dominance condition, however only models M1-M4 satisfy relationship (4). Model M5 does not satisfy (4) since the two components have different support. Even though the focus of this paper is on continuous mixture models, we also want to check the performance of the proposed methods for discrete mixture models such as M3 and M4. For each model, the true values of and are calculated and listed in Table 2 .

For each of the mixture models in Table 2 , we consider different values varying from .05 to .95. Under each model, we take N = 1000 repeated random samplings (16) and (17) respectively, standard normal p.d.f. is used as both kernels K 0 and K 1 , and the bandwidths b m and b n are chosen the same as in Silverman (1986) and Wu and Abedin (2021) . The effect of different choice of kernel on kernel estimator is trivial, while the bandwidth has more influence on the finite-sample performance and the convergence rate of the nonparametric kernel estimator. Nevertheless, Theorems 4-6 indicate that the MHDE has the consistency and asymptotic normality if the bandwidths are of the form b m , b n = O(N −r ) with r from a subinterval of (0, 1). This subinterval may depend on the population of the two components. With the component population unknown, we choose r = 1∕5 in our simulation as in Silverman (1986) and Wu and Abedin (2021) which achieves the optimal mean square error of the kernel estimator. In addition, the bandwidths b m and b n of our choice are adaptive in the sense that they involve a factor of robust scale estimate to incorporate the different variation of f and h.

To compute the estimates numerically, we use + , an estimator based on odds ratio, given by Smith and Vounatsou (1997) as the initial of . Initial values of and are calculated by exploiting the relationship (3), or equivalently

Thus for each T i in the pooled sample, we generate the pair Tables 3, 4 and 5.  Table 3 gives the biases and MSEs of ̂M ELE and ̂M HDE for all the five models with varying values and sample size m = n = 30 whereas Table 4 is for m = n = 100 . For the purpose of comparison, we also report the results of the two estimators ̂ and ̂L proposed in Wu and Abedin (2021) for model (1) directly, without assuming the relationship (3). The ̂ is based on cumulative distribution estimation while ̂L is the MLE of the multinomial approximation of model (1). From Tables 3 and 4 we can see that all the four methods are very competitive when is concerned. The nonparametric estimators ̂ and ̂L perform slightly better than the semiparametric estimators ̂M ELE and ̂M HDE when the two components are close to each other (such as M1 and M3) or the relationship (3) does not holds (M5), while ̂M ELE and ̂M HDE perform better than ̂ and ̂L when the two components are relatively far away (such as M2). Both the MELE and the MHDE perform surprisingly well for model M5, even though the assumed relationship (3) is not valid for M5.

The simulation results for estimating are given in Table 5 . Since (3) is not assumed in Wu and Abedin (2021) , their methods are for estimating only and thus

in Table 5 we compare ̂M ELE and ̂M HDE only. In addition, we omit the results for and model M5 as depends on through normalization and M5 doesn't satisfy (3). From Table 5 we observe that the MHDE performs better than the MELE in terms of both smaller bias and MSE. However, the MHDE tends to give very large bias and MSE when is small, such as M2 with = .05 . We also observe from Table 5 that the MSEs of both MHDE and MELE of decrease as the true value increases, which is very consistent with the observations from Table 1 on their asymptotic variances. This phenomenon is justified by the fact that when increases the expected number of observations from the second component g gets larger, and as a result the data contains more and more information about . This is also the reason why the MHDE and the MELE give large biases and inflated MSEs for small , especially when sample size is small. For example, for model M2 with = 0.05 and m = n = 30 (or 100), we would expect only n = 1.5 (or 5) observations on average from g that contain information about , which thus produces estimates of with large bias and MSE. In order to examine the effect of unbalanced sample sizes on the performance of MELE and MHDE, in Table 6 we present the simulation results for sample sizes (m, n) = (50, 150) and (150, 50). Comparing Table 6 with Table 4 , we observe that when is concerned, the MELE and MHDE under unbalanced sample sizes (m, n) = (50, 150) and (150, 50) perform slightly worse, though still comparable, than under balanced sample sizes (m, n) = (100, 100) , especially in terms of MSE. When is concerned and Table 6 is compared with Table 5 , we observe that the performance of the MELE and MHDE under unbalanced sample sizes is significantly worse than that under balanced sample sizes, especially in terms of MSE. These phenomenon can be possibly explained by the fact that both the asymptotic covariance matrices of MELE and MHDE given in Theorems 2 and 6 respectively have a factor 1

(1− ) in front. Other places in the covariance matrix expressions also involve , though not as influencing as this factor, which makes the level of deterioration different for and when m∕(m + n) deviates away from 1/2 (i.e., balanced samples).

To examine the performance of p MELE and p MHDE given in (8) and (18) respectively, we calculate misclassification rates (MR) based on these estimates and a simple classification rule with use of .5 as the hard threshold, i.e., an individual with observation y is classified as from G if p(y) > .5 and as from F if otherwise, where p is either p MELE or p MHDE . Considering the fact that MR is higher for some models, such as M1 when the two components significantly overlap, than some other models such as M2, we use the optimal misclassification rate (OMR) in Wu and Abedin (2021) as the baseline to compare with. The OMR is the misclassification rate calculated assuming p(y) is completely known (ideal scenario) and the same value .5 is used as the classification threshold. The simulation results are presented in Table 7 . For the purpose of comparison, we also report the results of this classification rule but based on the nonparametric estimators ̂ and ̂L in Wu and Abedin (2021) .

From Table 7 we observe that ̂L performs worst, while other three methods are quite competitive. Particularly, ̂M ELE and ̂M HDE performs much better than ̂ in model M2, while ̂ has a little better performance in M1, M3 and M5. Even though the assumed relationship (3) is not valid for M5, the classification results based on 

In this subsection, we examine the robustness properties of the proposed MELE and MHDE. In order to compare with the two nonparametric estimators in Wu and Abedin (2021), we check the performance of estimators of only but not and that Wu and Abedin (2021) didn't introduce. Specifically, we examine the behavior of the four estimators, ̂M ELE , ̂M HDE , ̂ and ̂L when data are contaminated by a single outlying observation. Presence of several outliers will be similar and thus omitted here. Note that the outlying observation can be in either the first sample from f(x) or the second sample from the mixture h(x). We report results for the latter case and similar results are observed for the former case. We look at the change in estimate before and after data contamination. For this purpose, the -influence function ( -IF) given in Beran (1977) is an appropriate measure of the change in estimate. However its application in mixture context is very difficult, as discussed in Karlis and Xekalaki (1998) . Therefore, we employ an adaptive version of the -IF as in Lu et al. (2003) which uses the change in estimate, before and after outlying observations are included, divided by contamination rate. Taking model M1 as an example, after drawing two independent samples with one from N(0, 1) and the other from (1 − )N(0, 1) + N(1, 1) , we replace the last observation generated from the mixture with a single outlier, an integer from the interval [−30, 20] . Then the -IF is given by averaged over 100 replications, where the total sample size N = 60 or 200 and W is an estimator of . In our study, W is either ̂M ELE , ̂M HDE , ̂ or ̂L . The results are are compared, ̂L behaves the worst in terms of having largest -IF for mixture of normals and ̂ behaves the worst for mixture of Poisson distributions.

In this part, we examine the performance of the K-S tests we proposed in Sect. 5 for testing the validity of model (3). We consider model (3) with r(x) = (x, x 2 ) ⊤ as the collection of all possible models under consideration. Then we test whether the reduced model (3) with r(x) = x is the actual true model or not. For demonstration purpose, we consider mixture of normals N(0, 1) and G ∼ N( , 2 ) . Then f(x) and h(x) are related by where = − 1 2 log 2 + 2 2 , = 2 and = 1 2 1 − 1 2 . Note that (24) is a special case of (3) when r(x) = (x, x 2 ) ⊤ . If = 1 , then = 0 and thus model (3) holds with r(x) = x . So testing the validity of model (3) with r(x) = x is equivalent to testing the null hypothesis H 0 ∶ = 0 under model (24). We consider = 0 , −.9 and −1.5 , = .35 and .65, and sample sizes m = n = 30 and m = n = 100 . For simplicity, we just fix = 1 and as a result = 1 , .6 and .5 for = 0 , −.9 and −1.5 , respectively. For each , and sample size considered, we use 500 total number of replications for our calculation. Within each replication, we use totally 1000 bootstrapping Table 8 . Note that = 0 means model (3) with r(x) = x is correct and thus the correspondingly calculated values in Table 8 are the estimated significance levels. When ≠ 0 , model (3) with r(x) = x is not correct and thus the correspondingly calculated values in Table 8 are the estimated powers at that value of .

From Table 8 we can see that the two test statistics KS MELE and KS MHDE are quite competitive in terms of achieved significance level and power. The achieved levels of significance are quite close to the true levels for most of the cases except for the case of KS MHDE with = 0.65 and m = n = 30 . The power of KS MHDE becomes larger when is away from 0 except for the case of = 0.65 and m = n = 30 . Surprisingly, the power of KS MELE becomes smaller when is away from 0 except for the case of = 0.65 and m = n = 100 . Note that KS MELE and KS MHDE use respectively the MELE and MHDE of , which have been shown in Sect. 6.1 to have large bias and MSE when sample sizes are small. Consequently, the results for m = n = 30 have some abnormals while the results for m = n = 100 are much better and more reliable. As expected, when the significance level a decreases, both the observed significance level and power decrease. For both KS MELE and KS MHDE , the powers are generally high for significance levels a = 0.10 and 0.05.

In this section we consider two real-life data examples and demonstrate the applications of our proposed methods. Example 1: Grain data. Smith and Vounatsou (1997) analyzed a data to determine the proportion of cells in the test sample with size n = 94 that were exposed to radioactive materials. The cells in control group with size m = 94 were not exposed to radioactivity. The number of grains shown in autoradiograph of cells can measure the amount of radioactive material, however grains can appear in autoradiograph due to radioactive material or due to background fogging. This dataset and more details of the data are available in Smith and Vounatsou (1997) , and Wu and Abedin (2021) showed that the dominance constraint F ≥ G is valid for this dataset. We apply the two proposed estimators to this data and compare them with the estimators in Wu and Abedin (2021) , Smith and Vounatsou (1997) and Smith et al. (1986) . The results are given in Table 9 , in which bootstrap method with 1000 repetitions are used to calculate the confidence intervals of .

From Table 9 we can see that our two proposed methods produce similar point estimates of to most other methods available in literature except for the two-by-two table Poisson mixture (Smith et al., 1986) .77 .00-.91

Two-by-two table (Smith and Vounatsou, 1997) .20 .00-1.00

Logistic power (Smith and Vounatsou, 1997) .61 .58-.64

Monotone logistic (Smith and Vounatsou, 1997) .74 .61-1.00

Latent class (Smith and Vounatsou, 1997) .73 .63-.83 and Logistic power methods. When interval estimation is concerned, the two proposed estimators produce confidence intervals with width similar to those of ̂ , ̂L and the latent class method but much smaller than those of Poisson mixture, two-by-two table and monotone logistic methods. In Table 9 , the confidence intervals in the parentheses for ̂M ELE and ̂M HDE are calculated using the asymptotic covariance matrices given in Theorems 2 and 6, respectively. From the results we see that bootstrap approximation is quite accurate. So when making inferences about , computationally intensive bootstrap is not necessary for ̂M ELE and ̂M HDE while it is necessary for all other methods due to the unavailability of results on their asymptotic distribution. This is another great benefit of using ̂M ELE and ̂M HDE over other methods. Example 2: Malaria data.

We also examine a clinical malaria dataset first described by Kitua et al. (1996) and further studied by Vounatsou et al. (1998) . Parasite densities in children with fever can be modelled as a two-component mixture, where one component represents the parasite densities in children without clinical malaria (F) while the other with clinical malaria (G), and is the proportion of children whose fever is attributable to malaria. In this dataset, level of parasitaemia were observed on children between 6 and 9 months in a village in Tanzania for both the wet season ( m = 94 , n = 251 ) and the dry season ( m = 122 , n = 245 ). Wu and Abedin (2021) showed that the dominance constraint F ≥ G is valid for both wet and dry seasons.

We apply the proposed ̂M ELE to both seasons and compare the results with ̂L in Wu and Abedin (2021) and the Bayesian approach in Vounatsou et al. (1998) . Note that this is a discretized data (parasitaemia levels were grouped into 10 categories), so kernel smoothing and hence ̂M HDE and ̂ are not applicable for this data. Table 10 presents the estimation results where the numbers in parentheses are estimated standard errors based on 500 bootstrapping samples. From this table we observe that ̂M ELE , ̂L and Bayesian approach produce similar estimates.

In this work, we proposed a two-component semiparametric mixture model (4) to accommodate the stochastic dominance constraint. We not only proposed and studied two estimators, the MELE and the MHDE, but also proposed K-S statistics to test the validity of the model (4). For both estimators, we proved theoretically their consistency and asymptotic normality. The finite-sample performance, in terms of both efficiency and robustness, of the proposed estimators and tests of the semiparametric model were examined through simulation studies and real data analysis.

The introduction of the two-component semiparametric mixture model (4) has several advantages over the fully nonparametric mixture model (1). First, the semiparametric model automatically solves the unidentifiability issue that the fully nonparametric model generally has. Second, the semiparametric model easily accommodates the stochastic dominance constraint with an equivalent and natural positivity constraint on the parameter , while the dominance constraint in the original nonparametric model is imposed on two functions which is much harder to handle and assess directly. Third, the semiparametric model has much better interpretation than the nonparametric model. The value quantifies the difference between the two components, and the higher the value the larger the difference. Fourth, for the semiparametric model we proposed two estimators with proved asymptotic normality and derived asymptotic covariance matrices, and these asymptotic results are in lack in literature for the nonparametric model. As a result, we can easily conduct statistical inferences for the mixing proportion (and ) based on these derived asymptotics for the semiparametric model, while for the nonparametric model some computationally intensive method, such as bootstrap, is necessary in order to make inferences about , due to the unavailability of asymptotic properties of available estimators. Of course, using the semiparametric model instead of the fully nonparametric model will lose some flexibility which we think will not be significant given the fact that many commonly used population families satisfy the exponential tilt relationship (3). This loss is worth, considering the aforementioned advantages of introducing the semiparametric model.

When the two proposed estimators for the semiparametric mixture model (4) are compared, we observe the following preference from our numerical studies. When the estimation of the mixing proportion or classification is our interest, MELE and MHDE perform equivalently well, in terms of bias and MSE or misclassification rate, and no one is dominantly better than the other. When the estimation accuracy of is our focus, MHDE is highly preferred over MELE when is moderate or large while MELE is preferred when is small. When the presence of outliers is our concern, MHDE is much more robust and thus is highly preferred over MELE.

Since the proposed model and methods were motivated by the example in case-control genetic studies based on test statistic values (one dimension), this paper only focused on univariate case. Nevertheless, we can see that the proposed MELE and MHDE methods can be easily extended to multidimensional case. On the other side, the kernel estimations used in MHDE method may not be able to handle high dimensionality well.

For future work, we may consider to use minimum profile Hellinger distance estimation (MPHDE) for the semiparametric mixture model (4). Wu and Karunamuni (2015) first introduced the profile Hellinger distance particularly for semiparametric models and investigated the MPHDE for semiparametric model of general form. Wu and Karunamuni (2015) proved that the MPHDE is as robust as MHDE and achieves full efficiency at the true model. Xiang et al. (2014) used the MPHDE for a two-component semiparametric mixture model where one component is known up to some unknown parameters while the other component is unspecified. Wu et al. (2017) and Wu and Zhou (2018) applied the MPHDE for semiparametric location-shifted mixture models. Another direction to approach the estimation problem of model (4) could be Bayesian method. Vounatsou et al. (1998) applied Gibbs sampling approach to estimate the unknown mixing proportion in a two-component mixture model with discretized data. 

C11) hold with b m , b n ≍ N −r , 1∕4 < r < 1∕2 , and any sequence N such that N = o(N r log N) and log N = o( N )

Supplementary Information The online version contains supplementary material

Inferences for two-component mixture models with stochastic dominance

Separate sample logistic discrimination

Multivariate logistic compounds

Minimum Hellinger distance estimators for parametric models

Semiparametric estimation of a two-component mixture model where one component is known

Statistical Methods in Cancer Research: the Analysis of Case-Control Studies

Molecular classification of acute leukemia

An improved Goodness-of-Fit test for logistic regression models based on case-control data by random partition

Using specially designed exponential families for density estimation

Discovering condition-specific gene co-expression patterns using gaussian mixture models: A cancer case study

Minimum Hellinger distance estimation for Poisson mixtures

One-step minimum Hellinger distance estimation

Bayesian approach to single-cell differential expression analysis

Plasmodium falciparum malaria in the first year of life in an area of intense and perennial transmission

Efficiency versus robustness: The case for minimum Hellinger distance and related methods

Analyzing allele specific RNA expression using mixture models

Minimum Hellinger distance estimation for finite mixtures of Poisson regression models and its applications

Logistic disease incidence models and case-control studies

Empirical likelihood ratio based confidence intervals for mixture proportions

Empirical likelihood and general estimating equations. The Annals of Statistics

A goodness of fit test for logistic regression models based on case-control data

Density Estimation for Statistics and Data Analysis

Logistic regression and latent class models for estimating positives in diagnostic assays with poor resolution

Selection of a mouse embryonal carcinoma clone resistant to the inhibition of metabolic cooperation by retinoic acid

Bayesian analysis of two-component mixture distributions applied to estimating malaria attributable fractions

Robust estimation of mixture complexity

Robust estimation of mixture complexity for count data

A two-component nonparametric mixture model with stochastic dominance

Profile Hellinger distance estimation

Minimum Hellinger distance estimation for a semiparametric locationshifted mixture model

Minimum Hellinger distance estimation in a two-sample semiparametric model

Computation of an efficient and robust estimator in a semiparametric mixture model

Minimum profile Hellinger distance estimation for a semiparametric mixture model

A chi-squared goodness-of-fit test for logistic regression models based on casecontrol data

An information matrix test for logistic regression models based on case-control data

An EM algorithm for a semiparametric finite mixture model

A score test under a semiparametric finite mixture model

The authors thank the Chief Editor, the Associate Editor and the referees for their valuable comments and suggestions that have led to significant improvements in the manuscript. The authors acknowledge with gratitude the support for this research via Discovery Grants project RGPIN-2018-04328 from Natural Sciences and Engineering Research Council (NSERC) of Canada and NSF project ZR2021MA048 of Shandong Province of China.

We decompose the parameter vector into two parts = , ⊤ r ⊤ , where r = ( , ) ⊤ represents the regression coefficient parameters in (4). Note that g r (x) = e + x f (x) is essentially the g in (3). Parallelly for each t ∈ we write t = t 1 , t ⊤ r ⊤ with t r = (t 2 , t 3 ) ⊤ and g t r (x) = e t 2 +t 3 x f (x) .(D1) There exists an -neighbourhood B( r , ) of r for some ≥ 0 such that g t r − g r is bounded by an integrable function for any t r ∈ B( r , ).(D2) f and K 0 in (4) and (16) (19) and(C5) The second derivative of f exists and satisfies for i = 1, 2, 3 that as N → ∞ ,Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.