Evidence for measurement bias of the short form health survey based on sex and metropolitan influence zone in a secondary care population


RESEARCH Open Access

Evidence for measurement bias of the
short form health survey based on sex and
metropolitan influence zone in a secondary
care population
Jake Ursenbach1, Megan E. O’Connell1*, Andrew Kirk2 and Debra Morgan3

Abstract

Background and objectives: The 12-item Short Form Health Survey (SF-12) is a widely used measure of health
related quality of life, but has been criticized for lacking an empirically supported model and producing biased
estimates of mental and physical health status for some groups. We explored a model of measurement with the
SF-12 and explored evidence for measurement invariance of the SF-12.

Research design and methods: The SF-12 was completed by 429 caregivers who accompanied patients with
cognitive concerns to a memory clinic designed to service rural/remote-dwelling individuals. A multi-group
confirmatory factor analysis was used to compare the theoretical measurement model to two empirically identified
factor models reported previously in general population studies.

Results: A model that allowed mental and physical health to correlate, and some items to cross-load provided the
best fit to the data. Using that model, measurement invariance was then assessed across sex and metropolitan
influence zone (MIZ; a standardized measure of degree of rurality).

Discussion: Partial scalar invariance was demonstrated in both analyses. Differences by sex in latent item intercepts
were found for items assessing feelings of energy and depression. Differences by MIZ in latent item intercepts were
found for an item concerning how current health limits activities.

Implications: The fitting model was one where the mental and physical health subscales were correlated, which is
not provided in the scoring program offered by the publishers. Participants’ sex and MIZ should be accounted for
when comparing their factor scores on the SF-12. Additionally, consideration of geographic residence and
associated cultural influences is recommended in future development and use of psychological measures with such
populations.

Keywords: Equivalency, Test bias, Psychometric theory, Quality of life

© The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License,
which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if
changes were made. The images or other third party material in this article are included in the article's Creative Commons
licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons
licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain
permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the
data made available in this article, unless otherwise stated in a credit line to the data.

* Correspondence: megan.oconnell@usask.ca
1Department of Psychology, University of Saskatchewan, 9 Campus Drive,
Saskatoon S7N 5A5, Saskatchewan, Canada
Full list of author information is available at the end of the article

Ursenbach et al. Health and Quality of Life Outcomes           (2020) 18:91 
https://doi.org/10.1186/s12955-020-01318-y

http://crossmark.crossref.org/dialog/?doi=10.1186/s12955-020-01318-y&domain=pdf
http://creativecommons.org/licenses/by/4.0/
http://creativecommons.org/publicdomain/zero/1.0/
mailto:megan.oconnell@usask.ca


Background and objectives
As the proportion of older adults in Canada grows, it is
crucial that that healthcare services in Canada adapt to
meet their needs [1]. One important way healthcare and
other policy decisions are made involves assessing indi-
viduals’ health-related quality of life, which is a construct
that summarizes their physical, social, and emotional
status as it relates to their prior and current health state
[2]. One commonly used measure of health related qual-
ity of life is the Medical Outcomes Study 12-item Short
Form Health Survey (SF-12) [3].
The items of the SF-12 were derived from the 36-Item

Short Form Health Survey (SF-36), a longer health survey
which has been used in more than 5000 studies inter-
nationally [4] that has consistently demonstrated utility in
distinguishing between known groups based on physical or
mental health status [3]. The SF-36 has eight subscales cov-
ering a range of physical and mental health concerns, such
as ‘role limitations due to physical health’ and ‘emotional
well-being.’ The subscale scores are combined to produce a
physical health summary component (PCS) and a mental
health summary component (MCS), which are based on a
principal component analysis of the eight subscale scores
using an orthogonal rotation, based on an assumption that
physical and mental health are not correlated [3]. The
stated goal of the SF-12 was to reproduce the PCS and
MCS scores in a survey that could be completed in under
two minutes. To this end, the SF-12 items were selected
using stepwise regressions of the MCS and PCS on the SF-
36 items in a large population study conducted in the
United States, which produced regression weights for 12 of
the SF-36 items that best approximated the subscale scores
of the test. This model approximation approach to test
length reduction is effective in some situations, but may
not generalize well to different populations [5].
Despite its widespread use, two prominent concerns

about the validity of the SF-12 have been raised. First, some
researchers argue that the measurement model of the scale,
although theoretically described by Ware et al. [3], was not
actually tested in the creation of the scale and is not empir-
ically supported [6]. This criticism relates to the way that
the summary scores for the mental and physical health sub-
scales were derived, specifically, to the extent that the meas-
urement model was derived from the SF-36, and the
assumption that the latent constructs of physical and men-
tal health are uncorrelated [7, 8]. Instruments such as the
SF-36 and the SF-12 consist of a series of Likert-scale ques-
tions which act as indicators, or manifest variables, influ-
enced by unobserved, or latent variables, which for the SF-
12 are the mental and physical health related quality of life.
Several studies have employed an exploratory factor ana-
lysis approach using a principal component analysis with
orthogonal rotation, consistent with the development of the
SF-36 and the assumption of uncorrelated mental and

physical health. While some of these studies supported the
hypothesized 2 component structure [9–11], others pro-
duced a three-component solution, with a general health
component in addition to the mental and physical health
components [12, 13]. While the evidence of a two-factor
solution is supportive of the theoretical model, significant
methodological issues limit the strength of this evidence,
specifically, the estimated correlations between the individ-
ual items and the latent physical and mental health vari-
ables are likely inaccurate due to the use of principal
component analysis, which assumes an initial communality
of one, and the specified orthogonal relationship between
the mental and physical health components, which is not
supported empirically. The assumption of an orthogonal
relationship has resulted in biased scoring coefficients
where poorer physical health results in an overestimate of
mental health and vice versa [6, 14–16].
Other studies have used alternative approaches to de-

termine the appropriate measurement model of the SF-
12. The models identified by two such studies are sum-
marized in Table 1. In the first study, Fleishman and
Lawrence [17] used a confirmatory factor analysis (CFA)
to explore the factor structure by beginning with the
simple structure outlined by Ware et al. [3], and then in-
crementally improving it by allowing the factors to cor-
relate and using the modification indices. They achieved
an adequate fit by allowing the residuals on several simi-
larly worded items to correlate, as well as allowing cer-
tain items to cross-load on both factors. In the second
study, Tucker et al. [7] followed a similar approach but
did not permit items to cross load. In summary, al-
though the SF-12 generally appears to have a two-factor
solution, the measurement model specified by Ware and
colleagues does not seem well supported empirically.
Various alternative models have been identified in gen-
eral population studies, and it is not clear which is ap-
propriate for the subpopulation of the present study.
The second concern is that the physical and mental

health related quality of life estimates of the scale are
not invariant, or in other words, are biased against some
populations, and therefore group comparisons are not ap-
propriate unless measurement invariance is established

Table 1 Sample Characteristics

Characteristic

Sample size (n) 429

Age, M (SD) 70.9 (0.49)

Sex, Female (%) 238 (58.8%)

MIZ, no MIZ to weak MIZ vs moderate to urban (%) 191 (47.2%)

SF-12, M (SD) 44.49 (7.43)

PCS, M (SD) 20.23 (4.67)

MCS, M (SD) 24.20 (3.86)

Ursenbach et al. Health and Quality of Life Outcomes           (2020) 18:91 Page 2 of 10


first. That is, does the SF-12 measures these latent vari-
ables in the same manner for all persons who complete
the scale, or does it measure these latent variables differ-
ently for subpopulations [18]. When an instrument
demonstrates measurement invariance, knowledge of
population membership will provide no new information
about an individual’s scores on the observed variables
given knowledge of their level of the latent variable [18].
Measurement invariance is difficult to assess because it is
impossible to directly know an individual’s level of the la-
tent variable. For example, men and women could demon-
strate different mean scores on the MCS scale, which
could represent true differences in mental health status,
differences in measurement model but equality in mental
health status, or a combination of the two.
One approach to assessing measurement invariance in-

volves assessing items individually for differential item func-
tioning. Some studies have found evidence of significant
differential item functioning across groups on the SF-12.
For example, in a nationally representative sample study
conducted in the United States, Flieshman and Lawrence
[17] found evidence that men were less likely to endorse
items suggesting they had trouble climbing stairs, they felt
downhearted, or they lacked energy when compared to
women of a similar mental and physical health status. Simi-
lar problems with differential item functioning have been
reported when comparing White to Black and Hispanic
American respondents, and comparing younger to older re-
spondents [17, 19]. In contrast, some studies have investi-
gated differential item functioning across age and sex and
have not found evidence of it [20], while others have found
evidence of it when comparing patients with stroke to nor-
mal controls, but argued that the evidence present failed to
reach the level of practical significance [21].
A second approach to assessing measurement invari-

ance involves using multigroup confirmatory factor ana-
lysis (MG-CFA) to compare the change in model fit
indices across a series of models which progressively
constrain the structural model to be equal across groups
[18]. As the models become incrementally more restrict-
ive, a hierarchy of invariance has been established [22].
The most commonly assessed forms of invariance are:
Configural, where groups have the same number of la-
tent variables, and items load on latent variables in a
similar pattern; Metric (weak factorial), where item load-
ings do not significantly differ across groups; Scalar
(strong factorial), where item intercepts do not signifi-
cantly differ across groups; and Strict factorial, where
item residual variances do not differ across groups [22].
In order to meaningfully compare group means on the
latent variable, scalar invariance is recommended [18].
A few studies of the SF-12 have been conducted using

this methodology, though predominantly in clinical or
marginalized groups. For example, Okonkwo et al. [21]

found evidence of metric invariance between patients
who experienced stroke and healthy controls. Similarly,
another study found evidence of strict invariance be-
tween four groups of Canadians with different levels of
vulnerable housing status [23]. However, the measure-
ment invariance of the SF-12 has never been investigated
in rural- versus urban-dwelling populations.
Many rural populations have higher rates of mortality,

disability, and chronic disease than urban-dwelling pop-
ulations [24]. Some factors contributing to this disparity
are structural, such as a low population density leading
to transportation issues and a lack of access to specialists
[24, 25]. Others have suggested that cultural differences,
such as an increased emphasis on self-reliance, may con-
tribute to health disparities as well, with the caveat that
rural populations are a heterogenous group that varies
along a continuum of acculturation [26, 27]. Indeed, the
idea that rural populations may be culturally distinct
from urban populations has a long history and empirical
support [28–30]. It is possible that cultural differences in
some rural populations may result in biased estimates of
health status when using the SF-12, for example, some
rural-dwelling individuals may place an emphasis on
self-reliance, which could contribute to systematic
underreporting of symptoms.
Establishing measurement invariance is a prerequisite

for test interpretation and group comparison [18]. As the
SF-12 is used not only for research, but to inform public
policy [8], failure to attend to these measurement issues
has potentially costly or harmful consequences. We will
investigate the measurement invariance of the SF-12’s
MCS and PCS subscales across sex and geographical prox-
imity to metropolitan areas. As evidence of differential
item functioning by sex has been demonstrated in previ-
ous studies [17], we hypothesize that the SF-12 will not
demonstrate metric invariance. However, in the previous
study, although the magnitude of loadings differed, the
number of factors and pattern of loadings was consistent
between men and women. As such we hypothesize that
the SF-12 will demonstrate configural invariance. Finally,
although there is evidence calling into question the
validity of the SF-12 with minority populations, its psycho-
metric properties in a rural-dwelling general population
have not been previously investigated, so there is no direct
evidence to support a directional hypothesis. For that rea-
son, a directional hypothesis is not made regarding the
measurement invariance for a rural versus urban-dwelling
dementia/cognitive concern caregiver population.

Research design and methods
Participants
The analyses in this study were conducted using archival
data collected at the Rural and Remote Memory Clinic
(RRMC) in Saskatoon, Saskatchewan. The clinic services a

Ursenbach et al. Health and Quality of Life Outcomes           (2020) 18:91 Page 3 of 10


predominantly rural patient population (with rurality de-
fined as living at least 100 km outside of the two major
urban centers in Saskatchewan). Patients are referred for
further investigation of memory or other cognitive or be-
havioral concerns. Participants in this study were a cohort
of individuals who accompanied patients to the clinic, typ-
ically family members, hereafter referred to as caregivers.
Exclusion criteria included inability to read and write in
English and mental or physical disability that precluded
completion of the questionnaires. They completed a ques-
tionnaire packet which included the SF-12 while the pa-
tient they accompanied was assessed. Further information
about data collection and other RRMC operations are de-
tailed elsewhere [31]. Research ethics approval was pro-
vided by the University of Saskatchewan Research Ethics
Board (REB BEH 03–1219).

Measures
The SF-12 has demonstrated evidence for reliability and
validity in numerous populations and settings [3, 9–12,
21, 32], however, some concerns have been raised about
the nature of the scoring algorithm and its validity with
minority groups [7, 8, 17, 19]. As part of the initial valid-
ation of the instrument, Ware et al. [3] re-analyzed many
cross sectional and longitudinal studies including the
SF-36 using only the SF-12 items, and successfully
reproduced the same pattern of results using the shorter
survey. Some studies which included only the SF-12
items have been conducted in clinical and general popu-
lations and have demonstrated various forms of validity
evidence such as known group [9, 11], convergent [9, 12,
21] and discriminant validity [32]. Test-retest reliability
estimates have been reported at one week (PCS = .79,
MCS = .79) [11] and two weeks (PCS = .86–.89, MCS =
.76–.77) [3]. Point estimates for internal consistency
have been reported that range from .80 to .87 for the
PCS and .74 to .82 for the MCS [10, 21, 32]. Validity evi-
dence has been presented for the SF-12 with a variety of
populations, including a general population in the US [3,
17], Australia [7], Canada, Bermuda, New Zealand, and
various European nations [8], among individuals with se-
vere mental illness [11], patients in primary care [9],
with stroke [21], Parkinson’s Disease [13], with post-
partum women [19], among older Canadian Mennonites
[10], and homeless/vulnerably housed Canadians [23].
Participants’ degree of rurality was quantified by using

the Metropolitan Influence Zone (MIZ) that corresponded
to their area of residence as reported by Statistics Canada.
Geographical areas in Canada outside metropolitan areas
are divided into different levels of MIZ according to the
proportion of the employed workforce that commute into
metropolitan areas as opposed to working locally [33]. For
the present analyses, participants were divided into two
groups: Low- to weak-MIZ (less than 5% of employed

workforce commute) compared with moderate-MIZ to
urban (greater than or equal to 5% commute or live in
urban centers). The cut point was chosen to facilitate
comparison with other studies using this population [34].

Statistical procedure
The analyses were conducted using R version 3.4.2 [35].
Missing data were assessed using Little’s MCAR test to
determine whether the missing data were missing com-
pletely at random (i.e., missing independently of other
variables, both observed and unobserved) [36]. In the
event that Little’s MCAR test was significant, dummy
variables were created coding missingness for each ob-
served variable to determine if data were missing at ran-
dom (MAR) or missing not at random (MNAR). These
dummy variables were then tested for independence
from the remaining observed variables (sex and MIZ)
using separate chi square tests and evaluated for signifi-
cance at a Bonferroni-adjusted p-critical value of .002,
where a significant result indicates that data for that
item are conditional on other observed variables (MAR),
whereas the absence of significant results for that item
suggest that the missing data are conditional on an un-
observed variable (MNAR). Although no attempt to im-
pute missing values was planned, the nature of the
missing data has important implications for the results
that are taken up in the discussion.
Data were visually inspected for univariate normality

using quantile-quantile plots. Then skewness and kur-
tosis statistics were calculated, divided by the standard
error of the estimate, and evaluated against a critical z
value of 1.96. Multivariate normality was assessed using
Mardia’s Test of Multivariate Skewness and Kurtosis. In
the event of nonnormally distributed data, mean- and
variance-corrected weighted least squares (WLSMV) es-
timation was planned to account for the violated as-
sumption where appropriate in the remaining analyses.
Descriptive statistics were reported using independent

samples t-tests, Pearson’s correlations, and Fisher’s Z-
Tests where appropriate for between-group compari-
sons. The measurement model of the SF-12 was then de-
termined in this population by using CFA to compare
the fit of the hypothesized model to empirically sup-
ported SF-12 models in other populations. Specifically,
the model described by Ware et al. [3] was compared to
others [7, 17]. Model fit was assessed based on the
model fit criteria recommended by Hu and Bentler [37],
with a comparative fit index (CFI) > .95, and a root mean
squared error of approximation (RMSEA) < .06. The au-
thors note that multiple fit indices should be considered
when determining if fit is adequate and note that the
RMSEA tends to be overly conservative in smaller sam-
ple sizes. The robust fit indices described above have
demonstrated adequate capacity to detect model

Ursenbach et al. Health and Quality of Life Outcomes           (2020) 18:91 Page 4 of 10


misspecifications in simulation studies of nonnormally
distributed data when evaluated using Hu and Bentler’s
criteria [38]. If none of the pre-specified models fit well,
the modification indices were consulted and the param-
eter most contributing to poor fit was iteratively freed
and model fit re-assessed until adequate. Based on the
measurement model, Cronbach’s alpha coefficients and
confidence intervals were estimated for the PCS and
MCS within groups by sex and MIZ.
Once the measurement model was established, two

analyses of measurement invariance were conducted
based on sex and MIZ. In both cases, a grouping variable
was coded with the demographic difference. First, con-
figural invariance was assessed by fitting a multigroup
CFA based on the measurement model previously deter-
mined in which item means, loadings, intercepts, and re-
siduals were estimated freely. Adequate model fit
according to the Hu and Bentler [37] guidelines provides
evidence that the scale supports configural invariance.
Subsequent forms of invariance were evaluated by com-
paring the change in CFI from the less constrained to
more constrained model, where a significant deterior-
ation in model fit is indicated by change in RMSEA >
.01 and/or change in CFI < −.004 [39, 40]. Metric invari-
ance was first assessed by constraining item slopes to be
equal between groups and comparing the change in CFI
from the configural to metric model. If metric invariance
was supported, scalar invariance was then assessed by
also constraining item intercepts to be equal and com-
paring the change in CFI from the metric to scalar
model. Similarly, if scalar invariance was supported,
strict invariance was also assessed by constraining item
residual variances to be equal and comparing change in
CFI from scalar to strict model. If invariance was not
supported at any level, constraints were iteratively re-
leased based on the modification indices to determine
partial invariance [18].

Results
Of the 544 participants in the initial sample, 21.1% were
missing data regarding either their MIZ or one or more
items on the SF-12, resulting in a final sample size of
429. The most common pattern of missingness was par-
ticipants who omitted all items on the SF-12, accounting
for 16.7% of the missing data. Little’s MCAR Test was
significant, χ2(244) = 333.18, p < .001, indicating data
were not MCAR, suggesting that missing data were con-
ditional on another variable. None of the follow-up chi
square tests were significant at the Bonferroni adjusted
p-critical value when testing the independence of each
item’s missingness from Sex and MIZ, suggesting that
the missing data were conditional on an unobserved
variable, or MNAR.

Univariate normality was assessed visually and statisti-
cally. All SF-12 items showed significant univariate skew
(p < .05), and all but items 1, 2, 5, 8, 11 were significantly
kurtotic (p < .05). Mardia’s test of multivariate skewness
and kurtosis was significant for both skewness (b = 30.03,
z = 2147.26, p < .001) and kurtosis (b = 297.50, z = 19.49,
p < .001). As the data were not normally distributed
WLSMV estimation was used.
Descriptive statistics of the sample are reported in

Table 1. The average age of participants was 70.9 years
(SD = 0.5). Most participants were female, and about half of
the sample resided in a low to weak MIZ area. Across the
full sample, participants’ mean raw score on the SF-12 was
44.5 (SD = 7.4). Women (M = 45.4, SD = 7.0) scored signifi-
cantly higher than men (M = 43.2, SD = 7.8), t (364.2) = −
3.04, p = .003. Participants scores from a no- to low-MIZ
area (M = 44.3, SD = 7.7) did not significantly differ from
those from a moderate-MIZ to metropolitan area (M =
44.6, SD = 7.2), t (416.9) = 0.38, p = .708. Within groups esti-
mates of internal consistency were calculated with 95%
confidence intervals for the SF-12 overall, and then for
MCS and PCS separately. In both between group compari-
sons, the Cronbach’s alpha confidence intervals overlapped,
suggesting that internal consistency did not significantly
differ by sex or MIZ. In addition, all estimates exceeded
0.70, providing evidence of adequate internal consistency.
The estimates were all significant moderate negative corre-
lations, which ranged from −.56 for women to −.57 for
men, and from −.57 for no- to low-MIZ to −.61 for moder-
ate MIZ to urban-dwellers. Fisher’s Z test was not signifi-
cant for sex, Z = 0.14, p > .05, or MIZ, Z = 0.63, p > .05,
indicating that the correlations did not significantly differ in
either comparison.
To establish a baseline measurement model, several in-

creasingly complex models were compared, as shown in
Table 2. Model 1, described by Ware et al. [3] but with
correlated factors, did not provide an adequate fit to the
data. Similarly, Model 2, reported by Flieshman and col-
leagues [17] also did not fit the data well. Model 3, de-
scribed by Tucker and colleagues [8] also failed to
produce an adequate fit. A model specification search re-
quired two iterative consultations of the modification indi-
ces and freeing of parameters to produce an adequately
fitting model. The final model used for the measurement
invariance analysis consisted of correlated physical and
mental health factors, and items 1,2,3,4,5,8,10,12 on the
physical health subscale and items 1,4,5,6,7,9,10,11,12 on
the mental health subscale, with residual covariances for
item pairs 5–6, 7–8, 9–10, 12–13, 12–14.
The SF-12 demonstrated partial scalar invariance with

regard to sex as indicated in Table 3. All relevant model
parameters were invariant to sex with two exceptions.
The latent intercepts for items 10 and 11 varied by sex,
shown in Table 4. The sample means for item 11 (feeling

Ursenbach et al. Health and Quality of Life Outcomes           (2020) 18:91 Page 5 of 10


depressed) were 4.18 for females and 3.90 for males, for
an observed mean difference of 0.28, with higher num-
bers suggestive of greater depressive symptomatology.
The latent intercept estimates for that item were 4.011
for males and 4.178 for females, a difference of 0.167.
These results indicate that of the 0.28 observed mean
difference, 0.167 is due to the difference in intercept,
suggesting that the minor difference in male and female
responses on the item is partly due to influences other
than the mental and physical health factors modelled
here. In contrast, while observed group means for item
10 (feeling energetic) were similar, with 3.46 for males
and 3.45 for females, the estimated latent intercepts dif-
fered, with 3.628 for males and 3.453 for females, a dif-
ference of 0.175, suggesting that despite similar observed
means, female scores were associated with slightly lower
factor scores relative to male scores due to influences
outside the factors modeled here.
The SF-12 demonstrated partial scalar invariance

with respect to MIZ in this sample. Factor loadings
and latent intercepts were equal between groups ex-
cept for the intercept for item 2 (Current health
limits moderate activities), as indicated in Table 5.
On that item, participants from the Moderate-MIZ/
Urban group had an observed mean of 2.47, while
those from the No−/Weak-MIZ group had a mean of
2.53, for an observed mean difference of 0.06. Separ-
ate latent intercepts were estimated for that item for

each group, with the Moderate-MIZ/Urban group es-
timated at 2.426 and the No−/Weak-MIZ group at
2.532, for a difference of 0.106, suggesting that while
the observed mean item scores for the two groups
are very similar, No−/Weak-MIZ group scores were
associated with slightly higher physical health scores
for reasons not captured by the factor model.

Discussion
Our first hypothesis, that the SF-12 would demonstrate
only configural invariance, was not supported. Rather,
our results indicate that the SF-12 demonstrates partial
scalar invariance across sex, suggesting that it may not
be appropriate to compare the MCS and PCS scores of
males and females in a rural, secondary care dementia/
cognitive concern caregiver population. There is a differ-
ence in latent intercept estimates by sex of 0.17 on item
11 (feeling depressed, 5-point Likert) favoring females,
and a difference of 0.18 on item 10 (feeling energetic, 5-
point Likert) favoring males. These differences are of
similar magnitude but in opposite directions, and there-
fore may balance out. Although previous studies have
not examined the measurement invariance of the SF-12
across sex using this methodology, most studies of dif-
ferential item functioning have found no evidence of
practically significant differences in the way scale items
function for men compared with women. For example,
one study conducted with a Parkinson’s disease patient

Table 2 Comparison of model fit for baseline model

χ2 (df) p value RMSEA [95% CI] CFI TLI

Model 1: Ware et al. with correlated factors 311.71 (53) <.001 .107 [.095–.118]a .790a .739

Model 2: model 1 + cross-loading items 1, 10, 12 198.31 (50) <.001 .083 [.071–.096]a .880a .841

Model 3: model 2 + residual covariances for items from same SF-36 scale 161.93 (46) <.001 .077 [.064–.090]a .906a .865

Model 4: model 3 + cross-loading items 4, 5 and residual covariance for items 9, 10 83.24 (43) <.001 .047 [.031–.062] .967 .950

Model 1: items 1,2,3,4,5,8 on physical health factor and items 6,7,9,10,11,12 on mental health factor
All models estimated using WLSMV
aPoor model fit indicated by RMSEA > .05 and CFI < .95

Table 3 Measurement invariance tests regarding sex and Metropolitan Influence Zone

Model χ2 df RMSEA [95% CI] CFI

Sex 1. Configural 120.002 86 .043 [.022–.060] .971

2. Metric 118.349 101 .028 [.000–.047] .985

3. Scalar 139.229 111 .035 [.010–.051] .976*

3a. Partial Scalar (intercepts for items 10,11 free) 129.661 109 .030 [.000–.048] .982

4. Partial Strict (intercepts for items 10,11 free) 141.434 121 .028 [.000–.046] .983

MIZ 1. Configural 118.580 86 .042 [.021–.060] .973

2. Metric 122.457 101 .032 [.000–.050] .982

3. Scalar 138.409 111 .034 [.008–.051] .977*

3a. Partial Scalar (Intercepts for item 2 free) 135.244 110 .033 [.000–.050] .979

4. Partial Strict (Intercepts for item 2 free) 163.871 122 .040 [.022–.055]* .965*

*significant deterioration in model fit indicated by change in RMSEA > .01 and/or change in CFI < −.004 [40]

Ursenbach et al. Health and Quality of Life Outcomes           (2020) 18:91 Page 6 of 10


population used an item response theory analysis and re-
ported that items appeared to function similarly for men
and women and concluded that comparisons across sex
were appropriate [20]. Another study found evidence
that some items functioned differently for men and
women in a stroke population, but concluded that the
significant results were attributable to the large sample
size and did not reach the level of practical significance
[21]. As previously discussed, one study did find evi-
dence of significant, meaningful differential item func-
tioning by sex in a nationally representative sample. The
authors attributed this to a male tendency to avoid

responding in a way that indicates weakness or depend-
ence. They noted that this interpretation was supported
in their sample by men’s statistical reticence to endorse
items suggesting difficulty climbing stairs, lacking
energy, or feeling downhearted [17].
Regarding our second hypothesis, we did not specify a

priori whether we anticipated measurement invariance
on the SF-12 across MIZ. The SF-12 demonstrated par-
tial scalar invariance with respect to MIZ in this sample.
The intercept for item 2 (Current health limits moderate
activities) differed across groups by 0.11 favoring the No
−/Weak-MIZ group, suggesting their responses were

Table 4 Factor model parameter estimates from partial strict model across sexa

Factor Loadings

Item Physical Mental Unique Variances Latent Intercepts

1. General health 1.000b 0.204 0.453 3.430

2. Current health limits moderate activities 0.804 0.250 2.555

3. Climbing stairs 0.830 0.293 2.450

4. Accomplishing less 1.157 0.484 0.548 4.017

5. Health limits kinds of activities 1.446 0.298 0.427 4.167

6. Emotional problems accomplishing less 1.000b 0.397 4.305

7. Emotional problems being less careful 0.902 0.242 4.520

8. Pain interferes with work 1.342 0.478 4.218

9. Feeling calm 0.505 0.491 3.669

10. Feeling energetic 0.588 0.361 0.499 Male: 3.628
Female: 3.453

11. Feeling depressed 0.574 0.666 Male = 4.011
Female = 4.178

12. Social activities 0.528 0.579 0.449 4.468
aAll parameters reported in unstandardized form
bParameter fixed to 1 for identification

Table 5 Factor model parameter estimates from partial strict model across MIZa

Factor Loadings

Item Physical Mental Latent Intercepts

1. General health 1.000b 0.204 3.316

2. Current health limits moderate activities 0.774 No/Weak MIZ: 2.532
Moderate MIZ/Urban: 2.426

3. Climbing stairs 0.812 2.363

4. Accomplishing less 1.220 0.484 3.869

5. Health limits kinds of activities 1.521 0.298 4.001

6. Emotional problems accomplishing less 1.000b 4.231

7. Emotional problems being less careful 0.900 4.449

8. Pain interferes with work 1.284 4.083

9. Feeling calm 0.510 3.626

10. Feeling energetic 0.565 0.342 3.441

11. Feeling depressed 0.610 4.060

12. Social activities 0.546 0.549 4.361
aAll parameters reported in unstandardized form
bParameter fixed to 1 for identification

Ursenbach et al. Health and Quality of Life Outcomes           (2020) 18:91 Page 7 of 10


associated with slightly higher physical health factor
scores for reasons not captured by the factor model.
One other study has investigated the psychometric prop-
erties of the SF-12 in a rural setting, specifically among
older adult rural-dwelling Mennonites in Canada [10].
Although they did not examine invariance directly, they
found evidence of validity in that population. Specific-
ally, using an exploratory factor analysis, they found the
expected two-factor solution, and they found evidence of
known group validity on a range of groups such as age,
income, marital status, self-reported health, social inter-
action, and spirituality. These results are generally con-
sistent with the present study as they suggest that the
SF-12 may be validly used in some rural-dwelling popu-
lations. Consistent with our findings of limited invari-
ance across MIZ, a previous study using a subpopulation
of the same dementia/cognitive concern caregiver popu-
lation found evidence of only configural and weak in-
variance across MIZ on the Zarit Burden Inventory [34],
a measure of dementia caregiver burden. Taken to-
gether, these results suggest that factors related to MIZ
influence the measurement properties of psychometric
instruments. Consideration of participants’ geographic
residence and associated cultural influences is recom-
mended in future development and use of psychological
measures with such populations.
There are some limitations to this study beyond the

issue of estimation and the violation of the assumption
of multivariate normality previously discussed. Specific-
ally, approximately one in five participants were missing
data and could not be included. Subsequent analysis sug-
gested that the data were MNAR, or in other words,
conditional on an unobserved variable. Although it is
not clear from the data extracted from the archival data-
set why these data were missing, it is possible that those
participants who did not provide data did so because the
SF-12 was not a valid instrument for them, which would
limit the generalizability of these results. For example, it
is possible that a culturally distinct subpopulation of de-
mentia/cognitive concern caregivers felt disenfranchised
due to a history of negative experiences in Canadian so-
cial programs and therefore chose not to participate in
data collection. While multiple imputation is typically
recommended when data are MAR and even MNAR
[41], in this case there is a risk that doing so will obscure
systematic differences in the missing subpopulation yet
provide the illusion of methodological rigour. Caution is
urged in the generalization of these findings because in
addition to the high proportion of missing data, the tar-
get population is quite unique, specifically, it is com-
prised of largely rural-dwelling caregivers of people with
cognitive concerns referred to secondary care.
In future research it is important to replicate these

findings in other urban and rural populations, ideally

using nationally representative samples to minimize
sampling bias. Other studies of the SF-12 have provided
evidence of differential item functioning in various pop-
ulations. Future research should examine the functioning
of individual items in this dementia/cognitive concern
caregiver population across different demographic vari-
ables to ensure that group comparisons are not biased.
Finally, this study provided further evidence that the
physical and mental health subscales of the SF-12 are
correlated, suggesting that use of scoring coefficients
that assume an orthogonal relationship produces in-
accurate estimates of mental and physical health related
quality of life [6, 14–16].
In conclusion, the current study adds to existing litera-

ture about the SF-12 by demonstrating the inadequacy
of the measurement model proposed by Ware et al. [3]
in a rural-dwelling dementia/cognitive concern caregiver
population. It also providing evidence for partial scalar
measurement invariance of the SF-12 across sex and
MIZ, indicating that within this population, some cau-
tion should be used when comparing the physical and
mental health related quality of life between those
groups using the SF-12.

Implications
Foremost, these data suggest the commercially available
scoring program that models the mental health and
physical health quality of life as orthogonal is not the
best fit to the data. Although these results should be
replicated, our findings have implications for use of the
commercially available scoring program for the SF-12.
Participants’ sex and MIZ should be accounted for when
comparing their factor scores on the SF-12. Additionally,
consideration of geographic residence and associated
cultural influences is recommended in future develop-
ment and use of psychological measures with such
populations.

Abbreviations
SF-12: Short Form Health Survey; MIZ: Metropolitan influence zone; SF-36: 36-
Item Short Form Health Survey; PCS: Physical health summary component;
MCS: Mental health summary component; CFA: Confirmatory factor analysis;
MG-CFA: Multigroup confirmatory factor analysis; RRMC: Rural and Remote
Memory Clinic; REB BEH: Research Ethics Board; MAR: Missing at random;
MNAR: Missing not at random; WLSMV: Variance-corrected weighted least
squares; CFI: Comparative fit index; RMSEA: Root mean squared error of
approximation

Acknowledgements
We acknowledge the families and patients of the RRMC.
Authors’ information (optional): none.

Authors’ contributions
MEO, AK, and DM collected the data. JU and MEO conceived the project, JU
did the analyses wrote the manuscript under MEO’s supervision. JU, MEO,
AK, and DM edited the manuscript. The author(s) read and approved the
final manuscript.

Ursenbach et al. Health and Quality of Life Outcomes           (2020) 18:91 Page 8 of 10


Funding
JU received a Master’s Award from the Canadian Institutes for Health
Research. Further trainee funding was provided by the Canadian Consortium
on Neurodegeneration in Aging (CCNA). CCNA is supported by a grant from
the Canadian Institutes of Health Research with funding from several
partners including the Saskatchewan Health Research Foundation. The RRMC
is funded by a CIHR Foundation Grant to DM, by SK Health, and by in-kind
support from the University of Saskatchewan Department of Psychology.

Availability of data and materials
Upon request.

Ethics approval and consent to participate
Obtained at data collection.

Consent for publication
Obtained in consent.

Competing interests
The authors declare that they have no competing interests.

Author details
1Department of Psychology, University of Saskatchewan, 9 Campus Drive,
Saskatoon S7N 5A5, Saskatchewan, Canada. 2College of Medicine, University
of Saskatchewan, 9 Campus Drive, Saskatoon S7N 5A5, Saskatchewan,
Canada. 3Canadian Centre for Health and Safety in Agriculture, University of
Saskatchewan, 9 Campus Drive, Saskatoon S7N 5A5, Saskatchewan, Canada.

Received: 17 May 2019 Accepted: 9 March 2020

References
1. Alzheimer's Society of Canada. Prevalence and monetary costs of dementia

in Canada. 2016. Retrieved from http://alzheimer.ca/sites/default/files/files/
national/statistics/prevalenceandcostsofdementia_en.pdf.

2. Borgaonkar MR. Quality of life measurement in gastrointestinal and liver
disorders. Gut. 2000;47(3):444–54. https://doi.org/10.1136/gut.47.3.444.

3. Ware J, Kosinski M, Keller S. A 12-item short-form health survey: construction
of scales and preliminary tests of reliability and validity. Med Care. 1996;
34(3):220–33.

4. Hawthorne G, Osborne RH, Taylor A, et al. The SF-36 version 2: critical
analyses of weights, scoring algorithms and population norms. Qual Life
Res. 2007;16(661):73.

5. Harrell FE. Regression modeling strategies: With applications to linear
models, logistic regression, and survival analysis (Second edition). New York:
Springer; 2015.

6. Hagell P, Westergren A, Årestedt K. Beware of the origin of numbers:
standard scoring of the SF-12 and SF-36 summary measures distorts
measurement and score interpretations. Res Nurs Health. 2017;40(4):378–86.

7. Tucker G, Adams R, Wilson D. Observed agreement problems between sub-
scales and summary components of the SF-36 version 2: an alternative
scoring method can correct the problem. PLoS One. 2013;8(4):e61191.
https://doi.org/10.1371/journal.pone.0061191.

8. Tucker G, Adams R, Wilson D. The case for using country-specific scoring
coefficients for scoring the SF-12, with scoring implications for the SF-36.
Qual Life Res. 2016;25(2):267–74. https://doi.org/10.1007/s11136-015-1083-7.

9. Amir M, Lewin-Epstein N, Becker G, Buskila D. Psychometric properties of
the SF-12 (Hebrew version) in a primary care population in Israel. Med Care.
2002;40(10):918–28.

10. Fisher K, Newbold KB. Validity of the SF-12 in a Canadian old order
Mennonite community. Appl Res Qual Life. 2014;9(2):429–48. https://doi.org/
10.1007/s11482-013-9241-y.

11. Salyers MP, Bosworth HB, Swanson JW, Lamb-Pagone J, Osher FC. Reliability
and validity of the SF-12 health survey among people with severe mental
illness. Med Care. 2000;38(11):1141–50.

12. Bentur N, King Y. The challenge of validating SF-12 for its use with
community-dwelling elderly in Israel. Qual Life Res. 2010;19(1):91–5. https://
doi.org/10.1007/s11136-009-9562-3.

13. Jakobsson U, Westergren A, Lindskov S, Hagell P. Construct validity of the
SF-12 in three different samples. J Eval Clin Pract. 2012;18(3):560–6. https://
doi.org/10.1111/j.1365-2753.2010.01623.x.

14. Johnson JA, Maddigan SL. Performance of the RAND-12 and SF-12 summary
scores in type 2 diabetes. Qual Life Res. 2004;13(2):449–56. https://doi.org/
10.1023/B:QURE.0000018494.72748.cf.

15. Niles AN, Sherbourne CD, Roy-Byrne PP, Stein MB, Sullivan G, Bystritsky A,
Craske MG. Anxiety treatment improves physical functioning with oblique
scoring of the SF-12 short form health survey. Gen Hosp Psychiatry. 2013;
35(3):291–6. https://doi.org/10.1016/j.genhosppsych.2012.12.004.

16. Windsor TD, Rodgers B, Butterworth P, Anstey KJ, Jorm AF. Measuring
physical and mental health using the SF-12: implications for community
surveys of mental health. Aust N Z J Psychiatry. 2006;40(9):797–803.

17. Fleishman J, Lawrence W. Demographic variation in SF-12 scores: True
differences or differential item functioning? Medical Care. 2003;41(7).
Retrieved from http://www.jstor.org/stable/3767691.

18. Millsap R. Statistical approaches to measurement invariance. New York:
Routledge; 2011.

19. Desouky TF, Mora PA, Howell EA. Measurement invariance of the SF-12
across European-American, Latina, and African-American postpartum
women. Qual Life Res. 2013;22(5):1135–44. https://doi.org/10.1007/s11136-
012-0232-5.

20. Hagell P, Westergren A. Measurement properties of the SF-12 health survey
in Parkinson’s disease. J Park Dis. 2011;1(2):185–96.

21. Okonkwo OC, Roth DL, Pulley L, Howard G. Confirmatory factor analysis of
the validity of the SF-12 for persons with and without a history of stroke.
Qual Life Res. 2010;19(9):1323–31. https://doi.org/10.1007/s11136-010-9691-
8.

22. Vandenberg RJ, Lance CE. A review and synthesis of the measurement
invariance literature: suggestions, practices, and recommendations for
organizational research. Organ Res Methods. 2000;3(1):4–70. https://doi.org/
10.1177/109442810031002.

23. Gadermann AM, Sawatzky R, Palepu A, Hubley AM, Zumbo BD, Aubry T,
et al. Minimal impact of response shift for SF-12 mental and physical health
status in homeless and vulnerably housed individuals: an item-level multi-
group analysis. Qual Life Res. 2017;26(6):1463–72. https://doi.org/10.1007/
s11136-016-1464-6.

24. Jones CA, Parker TS, Ahearn M, Mishra AK, Variyam JN. Health status and
health care access of farm and rural populations. U.S. Department of
Agriculture, Economics Research Services. 2009. Retrieved from https://
www.ers.usda.gov/webdocs/publications/44424/9370_eib57_
reportsummary_1_.pdf.

25. Arcury TA, Preisser JS, Gesler WM, Powers JM. Access to transportation and
health care utilization in a rural region. J Rural Health. 2005;21(1):31–8.

26. Hartley D. Rural health disparities, population health, and rural culture. Am J
Public Health. 2004;94(10):1675–8.

27. Wagonfeld MO. A snapshot of rural and frontier America. In: Stamm BH,
editor. Rural behavioral health care: An interdisciplinary guide. Washington,
DC: American Psychological Association; 2003.

28. Fischer C. Toward a subcultural theory of urbanism. Am J Sociol. 1975;80(6):
1319–41.

29. Tittle CR, Grasmick HG. Urbanity: influences of urbanness, structure, and
culture. Soc Sci Res. 2001;30(2):313–35.

30. Wirth L. Urbanism as a way of life. In R. Sennett (Ed.), Classic essays on the
culture of cities (1969; pp. 143–164). New York: Appleton–Century–Crofts;
1938.

31. Morgan DG, Crossley M, Kirk A, D’Arcy C, Stewart N, Biem J, et al. Improving
access to dementia care: development and evaluation of a rural and remote
memory clinic. Aging Ment Health. 2009;13(1):17–30. https://doi.org/10.
1080/13607860802154432.

32. Larson CO, Schlundt D, Patel K, Beard K, Hargreaves M. Validity of the SF-12
for use in a low-income African American community-based research
initiative (REACH 2010). 2008;5(2):14.

33. Statistics Canada. Census metropolitan influenced zone (MIZ). 2015.
Retrieved from http://www12.statcan.gc.ca/census-recensement/2011/ref/
dict/geo010-eng.cfm.

34. Branger C, O’Connell ME, Morgan DG. Factor analysis of the 12-item Zarit
burden interview in caregivers of persons diagnosed with dementia. J Appl
Gerontol. 2016;35(5):489–507. https://doi.org/10.1177/0733464813520222.

35. R Core Team. R: a language and environment for statistical computing. R
Foundation for Statistical Computing, Vienna, Austria. 2017. URL https://
www.R-project.org/.

36. Little RJA. A test of missing completely at random for multivariate data with
missing values. J Am Stat Assoc. 1988;83(404):1198.

Ursenbach et al. Health and Quality of Life Outcomes           (2020) 18:91 Page 9 of 10

http://alzheimer.ca/sites/default/files/files/national/statistics/prevalenceandcostsofdementia_en.pdf
http://alzheimer.ca/sites/default/files/files/national/statistics/prevalenceandcostsofdementia_en.pdf
https://doi.org/10.1136/gut.47.3.444
https://doi.org/10.1371/journal.pone.0061191
https://doi.org/10.1007/s11136-015-1083-7
https://doi.org/10.1007/s11482-013-9241-y
https://doi.org/10.1007/s11482-013-9241-y
https://doi.org/10.1007/s11136-009-9562-3
https://doi.org/10.1007/s11136-009-9562-3
https://doi.org/10.1111/j.1365-2753.2010.01623.x
https://doi.org/10.1111/j.1365-2753.2010.01623.x
https://doi.org/10.1023/B:QURE.0000018494.72748.cf
https://doi.org/10.1023/B:QURE.0000018494.72748.cf
https://doi.org/10.1016/j.genhosppsych.2012.12.004
http://www.jstor.org/stable/3767691
https://doi.org/10.1007/s11136-012-0232-5
https://doi.org/10.1007/s11136-012-0232-5
https://doi.org/10.1007/s11136-010-9691-8
https://doi.org/10.1007/s11136-010-9691-8
https://doi.org/10.1177/109442810031002
https://doi.org/10.1177/109442810031002
https://doi.org/10.1007/s11136-016-1464-6
https://doi.org/10.1007/s11136-016-1464-6
https://www.ers.usda.gov/webdocs/publications/44424/9370_eib57_reportsummary_1_.pdf
https://www.ers.usda.gov/webdocs/publications/44424/9370_eib57_reportsummary_1_.pdf
https://www.ers.usda.gov/webdocs/publications/44424/9370_eib57_reportsummary_1_.pdf
https://doi.org/10.1080/13607860802154432
https://doi.org/10.1080/13607860802154432
http://www12.statcan.gc.ca/census-recensement/2011/ref/dict/geo010-eng.cfm
http://www12.statcan.gc.ca/census-recensement/2011/ref/dict/geo010-eng.cfm
https://doi.org/10.1177/0733464813520222
https://www.R-project.org/
https://www.R-project.org/


37. Hu LT, Bentler PM. Cutoff criteria for fit indexes in covariance structure
analysis: conventional criteria versus new alternatives. Struct Equ Model
Multidiscip J. 1999;6(1):1–55. https://doi.org/10.1080/10705519909540118.

38. Brosseau-Liard PE, Savalei V. Adjusting incremental fit indices for
nonnormality. Multivar Behav Res. 2014;49(5):460–70. https://doi.org/10.
1080/00273171.2014.933697.

39. Cheung GW, Rensvold RB. Evaluating goodness-of-fit indexes for testing
measurement invariance. Struct Equ Model Multidiscip J. 2002;9(2):233–55.
https://doi.org/10.1207/S15328007SEM0902_5.

40. Rutkowski L, Svetina D. Measurement invariance in international surveys
Categorical indicators and fit measure performance. Appl Meas Educ. 2017;
30(1):39–51.

41. Baguley T, Andrews M. Handling missing data. In: Robertson J, Kaptein M,
editors. Modern statistical methods for HCI. Cham: Springer International
Publishing; 2016. p. 57–82. https://doi.org/10.1007/978-3-319-26633-6_4.

Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.

Ursenbach et al. Health and Quality of Life Outcomes           (2020) 18:91 Page 10 of 10

https://doi.org/10.1080/10705519909540118
https://doi.org/10.1080/00273171.2014.933697
https://doi.org/10.1080/00273171.2014.933697
https://doi.org/10.1207/S15328007SEM0902_5
https://doi.org/10.1007/978-3-319-26633-6_4

	Abstract
	Background and objectives
	Research design and methods
	Results
	Discussion
	Implications

	Background and objectives
	Research design and methods
	Participants
	Measures
	Statistical procedure

	Results
	Discussion
	Implications
	Abbreviations

	Acknowledgements
	Authors’ contributions
	Funding
	Availability of data and materials
	Ethics approval and consent to participate
	Consent for publication
	Competing interests
	Author details
	References
	Publisher’s Note