A strength-based mirror effect persists even when criterion shifts are unlikely


A strength-based mirror effect persists even when criterion shifts
are unlikely

Gregory J. Koop1 & Amy H. Criss2 & Angelina M. Pardini1

Published online: 21 February 2019
# The Psychonomic Society, Inc. 2019

Abstract
In single-item recognition, the strength-based mirror effect (SBME) is reliably obtained when encoding strength is manipulated
between lists or participants. Debate surrounds the degree to which this effect is due to differentiation (e.g., Criss Journal of
Memory and Language, 55, 461–478, 2006) or criterion shifts (e.g., Hicks & Starns Memory & Cognition, 42, 742–754, 2014).
Problematically, differing underlying control processes may be equally capable of producing an SBME. The ability of criterion shifts to
produce an SBME has been shown in prior work where differentiation was unlikely. The present work likewise produces an SBME
under conditions where criterion shifts are unlikely. Specifically, we demonstrate that an SBME can be elicited without the typical
number of trials needed to adjust one’s decision criterion (Experiments 1, 2, and 5) and using encoding manipulations that do not
explicitly alert participants that their memory quality has changed (Experiments 3 and 4). When taken in the context of the broader
literature, these results demonstrate the need to prioritize memory models that can predict SBMEs via multiple underlying processes.

Keywords Recognition . Strength based mirror effect . Differentiation . Criterion-shifts

Imagine you have just started teaching at a new university
when a friend comes to visit and requests a tour of campus.
During this tour, a group of young adults passes by, and your
friend asks you if you have any of them in class. Feeling
somewhat sheepish, you point to a couple of students that look
sort of familiar and identify them as being in your class.
(Fortunately, your friend has no way of knowing if you’re
wrong.) After only a week of classes, it remains exceedingly
difficult to distinguish between your own students and the
other students that look similar. As luck would have it, your
friend again passes through town 8 months later, and you
again are walking around campus when she asks you to iden-
tify any of your students. Having spent an academic year on
campus, you quickly and confidently identify only those stu-
dents that you actually had in class. Just as importantly, you
also note that you are much less tempted to misidentify the
other students in that group that you did not have in class. It
strikes you as odd that you can so easily dismiss these other

students because, after all, you have had roughly the same
amount of experience with them after a year of classes as
you did after a week of class (i.e., none). Why now, after a
year has passed, has it become easier to dismiss these un-
known students?

This opening example is analogous to a phenomenon in
recognition-memory research known as the strength-based
mirror effect (SBME; Glanzer & Adams, 1990; Glanzer,
Adams, Iverson, & Kim, 1993; Stretch & Wixted, 1998). In
a typical SBME task, participants study a series of strong
words or weak words, and then complete a single-item recog-
nition test over that material (see Fig. 1 for a version of this
design used in Experiment 1). At test, participants are gener-
ally presented with equal numbers of studied items (targets)
and unstudied (items) foils, and asked to identify each as
Bold^ or Bnew.^ The SBME describes the finding that
strengthening items at study improves performance in two
ways. One’s ability to correctly identify previously studied
items (the hit rate, or HR) improves, whereas the likelihood
of incorrectly recognizing unstudied material (the false-alarm
rate, or FAR) decreases. The question posed at the end of the
opening example is also one that has persisted in the literature:
Why do unstudied foils become easier to reject as the contents
of memory are strengthened? Why should foils benefit from
an encoding manipulation for which they, by definition, were
not present?

* Gregory J. Koop
gregory.koop@emu.edu

1 Department of Psychology, Eastern Mennonite University, 1200 Park
Road, Harrisonburg, VA 22802, USA

2 Syracuse University, Syracuse, NY, USA

Memory & Cognition (2019) 47:842–854
https://doi.org/10.3758/s13421-019-00906-8

http://crossmark.crossref.org/dialog/?doi=10.3758/s13421-019-00906-8&domain=pdf
mailto:gregory.koop@emu.edu


Significance of the strength-based mirror
effect to memory theory

While all contemporary models of memory predict that
strengthening items (whether through repetition, duration,
or Bdepth^) leads to a higher HR, the accompanying de-
crease in FAR has been somewhat more contentious. One
reason for this debate stems from the use of signal detec-
tion theory (SDT; Green & Swets, 1966; Macmillan &
Creelman, 2004) to analyze recognition performance.
Signal detection theory is a measurement model that de-
scribes performance in a recognition-memory task along
two spectra: discriminability and bias. Discriminability
represents the ease with which targets can be distinguished
from foils. Targets, by virtue of being studied, generate
more mnemonic evidence than foils during the recognition
test. As items are strengthened at study, targets acquire increas-
ing amounts of mnemonic evidence and the difference in
evidence between targets and foils increases (thereby increasing
discriminability). Applications of SDT to memory assume that
the mnemonic evidence for foils is unaffected by study.

Bias describes one’s general tendency toward giving Bold^
or Bnew^ responses. In other words, bias reflects the amount
of mnemonic evidence an individual requires to call an item
Bold.^ This threshold for calling an item Bold^ is known as a
decision criterion. For example, an individual who is more
Bold^ biased requires less evidence to call an item Bold^ and
therefore adopts a more liberal criterion.

Returning to the SBME, there exist two accounts for the
decrease in FAR for strong items relative to weak items.
The traditional explanation, here called the decisional ac-
count, assumes that changes in the FAR between strong
lists and weak lists is due to metacognitive decisional
processes—that is, criterion shifts (Benjamin & Bawa,
2004; Hirshman, 1995; Morrell, Gaitan, & Wixted, 2002;
Stretch & Wixted, 1998). Some assume the criterion is set
on the basis of the study list (e.g., Hirshman, 1995), and
others assume the criterion is set on the basis of the test list
(e.g., Verde & Rotello, 2007). In a standard SBME para-
digm these hypotheses are indistinguishable because the
strength of the targets is the same on study and test.
Participants in a strong list condition ostensibly realize
the quality of their memory is high and therefore require
more mnemonic evidence to call an item Bold^ (Starns,
Ratcliff, & White, 2012; Starns, White, & Ratcliff, 2010;
Stretch & Wixted, 1998; Verde & Rotello, 2007). This
more conservative criterion results in a lower FAR be-
cause, according to the decisional account, mnemonic ev-
idence for foils is only determined by preexperimental fa-
miliarity (Hirshman, 1995; Parks, 1966; Stretch & Wixted,
1998). However, the assumption that the mnemonic evi-
dence generated by foils does not change between strong
and weak lists has been contested.

We will call this second explanation the mnemonic ac-
count. In short, the mnemonic account claims that the
match between a test item and the contents of episodic
memory produces mnemonic evidence that necessarily de-
pends on the fidelity of the events in episodic memory
(e.g., Criss, 2006, 2009, 2010; Shiffrin & Steyvers,
1997). Relative to weak lists, the contents of episodic
memory traces are more complete in strong lists. As the
contents of memory become more complete, any item pre-
sented at test will be less likely to match by chance alone.
In other words, foils match the contents of the episodic
memory traces less well in strong lists and therefore func-
tionally produce less mnemonic evidence. That is, they
produce more evidence of not being in episodic memory.
This process, known as differentiation, is a fundamental
characteristic of episodic memory (see Criss & Koop,
2015, for a review; McClelland & Chappell, 1998).

Although these two accounts specify very different
mechanisms for the SBME, it is difficult to discriminate
between them using the standard pure-strength SBME de-
sign (Starns et al., 2012; Starns et al., 2010). One strategy
has been to look for mirror effects under conditions where
differentiation should not occur (e.g., Hicks & Starns,
2014; Starns & Olchowski, 2015; Starns et al., 2012;
Starns et al., 2010). Critically, this body of work shows
that criterion shifts alone can be sufficient to produce an
SBME. This literature does not ask if differentiation alone
could also produce an SBME. This is the primary question
engaged in this article. To address this question, we first
review the literature on when criterion shifts occur (and
when they do not).

Under what conditions do criterion shifts
occur?

The standard SBME design is a pure-strength, single-item
recognition task. That is, a given study–test cycle will
include only strong items or weak items, but never both.
Under these conditions, a mirror effect can be reliably
produced (e.g., Criss, 2006, 2009, 2010; Glanzer &
Adams, 1990; Hirshman, 1995; Koop & Criss, 2016;
Stretch & Wixted, 1998; among others). However, deci-
sional and mnemonic accounts are confounded in such a
design (Starns et al., 2012; Starns et al., 2010). In contrast
to pure-strength experiments, a mixed-strength experi-
ment presents strong and weak items within the same
study–test cycle (see Fig. 2 for a version of this design
used in Experiment 3 of this article) and assumes that a
criterion is set on the basis of the test. At first glance,
such a mixed-strength design would seem to distinguish
between decisional and mnemonic accounts because the over-
all degree of differentiation would be consistent for all test

Mem Cogn (2019) 47:842–854 843


items.1 Unfortunately, there is an obvious problem with this
logic—foils are not classified as strong or weak within the
experimental design. A foil is by definition unstudied and
not directly affected by any encoding manipulations.

Stretch and Wixted (1998) addressed this by explicitly cu-
ing anticipated strength. At test, each item was presented in
one of two colors. If an item was red (for example), that indi-
cated the item was either a strongly studied target or a foil. If
an item was blue, that indicated the item was either a weakly
studied target or a foil. The straightforward prediction, then,
was that participants should adopt a more stringent criterion
for red (strong) foils than for blue (weak) foils. Although the
HR was greater for strong items than for weak items, there was
no difference in FAR, even when participants were explicitly
alerted to the meaning of the color-cuing manipulation. Rather
than item color, Morrell et al. (2002) differentially strength-
ened one of two semantic categories at study. However, par-
ticipants again failed to show changes to FAR as a function of
category strength even when explicitly told that one category
would be strengthened. This led Morrell and colleagues to
conclude that although possible for participants to shift criteria
on an item-by-item basis, Bthey appear to be remarkably re-
luctant to do so even when they know they should, and it
would be easy for them to do were they so inclined^ (p.
1107). Many other studies have also failed to indicate
within-list strength-based criterion shifts (e.g., Bruno,
Higham, & Perfect, 2009, Experiment 1; Higham, Perfect, &
Bruno, 2009, Experiment 2; Verde & Rotello, 2007).

Eliciting criterion shifts within mixed-strength lists is a
fickle endeavor but there have been a few demonstrations that
participants can flexibly adjust their criterion. The literature
suggests two characteristics that increase the likelihood of
criterion shifts. The first is that participants are explicitly pro-
vided with clearly differing strength expectations. The second
is that participants have substantial time (or, more accurately,
trials) to adjust the criterion when changes in the testing envi-
ronment (and strength expectations) are not made explicit.

A study by Hicks and Starns (2014) demonstrates both of
these principles. They had participants study a mixed-strength
list. At test, strength was cued by color coding and by instruc-
tion. Participants were informed that items in red font (for
example) should be judged as studied once or as not studied,
and items presented in green font should be judged as having
been studied four times or not studied. Following a mixed-
strength study list, participants completed an 80-item test
where item strength was randomly intermixed or like-

strength items were grouped into blocks of varying size (40,
20, or 10 items). When like-strength items were blocked and
participants were clearly alerted to differing strength expecta-
tions via instructions and color cues, strength-based criterion
shifts (as measured by changes in FAR) were elicited. When
items were color cued but presented randomly, criterion shifts
were not consistently produced (only one of three experiments
showed the effect).

When the like-strength blocks were provided but color cu-
ing (and corresponding instructions) was withheld, false
alarms did not differ as a function of strength of the targets
in the test block. However, when blocks were 40 items in
length, participants that began with a weak block showed a
significantly higher FAR than those that began with a strong
block (Experiment 1). This finding led the authors to conclude
that Bparticipants do not stabilize their criterion in the first 10
or 20 trials, but getting a consistent and high expectation of
strength for 40 trials produces a criterion shift^ (Hicks &
Starns, 2014, p. 751).

Subsequent work has demonstrated an alternative, but con-
ceptually related, means to elicit strength-based criterion
shifts. Differences in FAR can be produced if participants
are required to use unique responses to expected-strong and
expected-weak items at test (Franks & Hicks, 2016; Starns &
Olchowski, 2015). Thus, we can conclude that criterion place-
ment requires substantial affordances: manipulations that
make participants clearly aware of differences between strong
and weak items (Franks & Hicks, 2016; Hicks & Starns, 2014;
Starns & Olchowski, 2015), and/or numerous like-strength
trials (more than 20; Hicks & Starns, 2014).

These findings fit with the general criterion shift narrative
that during a typical SBME paradigm, participants set the
criterion on the basis of expected test strength, which differs
between pure-weak lists and pure-strong lists. Evidence from
alternative bias manipulations like base rate (Estes & Maddox,
1995; Koop & Criss, 2016; Rhodes & Jacoby, 2007) or
distractor similarity (Benjamin & Bawa, 2004; Brown,
Steyvers, & Hemmer, 2007) support the notion that as chang-
es in the testing environment become more apparent to the
participant, criterion shifts become increasingly likely. By pro-
viding individuals with abundant affordances (e.g., very ex-
plicit cues, pure-strength blocking of significant duration) ex-
perimenters can usually elicit changes in FARs between weak
and strong items when differentiation should be minimized.

Why are pure-strength strength-based mirror
effects so reliable?

Given extensive explicit affordances, individuals shift the cri-
terion somewhat reliably. However, the affordances needed to
produce within-list criterion shifts are a clear departure from
the ease with which pure-list SBMEs have been consistently

1 Recall that differentiation occurs because test items are compared with the
contents of memory acquired during study (see Shiffrin & Steyvers, 1997). In
pure-strength experiments, test items are compared with either an entirely
strong list (producing poorer matches) or an entirely weak list (producing
better matches). Consequently, false-alarm rates will be lower for strong lists
than for weak lists. In a mixed-strength design, both strong and weak items are
compared back to the same (mixed-strength) contents of memory.

844 Mem Cogn (2019) 47:842–854


documented over decades of research (e.g., Criss, 2006, 2009,
2010; Glanzer & Adams, 1990; Hirshman, 1995; Koop &
Criss, 2016; Stretch & Wixted, 1998). One possibility is that
pure-strength lists produce an SBME so reliably because they
meet the two criteria for criterion shifts that were discussed
above: a sufficient number of trials to establish stable strength
expectations, and awareness of expected memory quality at
test (see Hicks & Starns, 2014, for a similar explanation).

The first question addressed by the experiments presented
here is whether an SBME is still produced in a pure-strength
design when the number of trials falls below the number of
trials required for establishing a criterion. If an SBME is still
evident, it would suggest that criterion shifts are not necessary
to produce a mirror effect in a pure-strength single-item rec-
ognition study.

Another explanation for the consistency of the pure-
strength SBME is that criterion adjustment is not necessary
in a pure-strength design. After all, the 40-trial threshold for
criterion setting comes from an experiment where the strength
of the study list (mixed) and test list (pure) differed. Perhaps
participants are much more efficient at setting criteria in pure-
strength lists because expectations about memory strength do
not change. This account is highly unlikely. Recent data have
indicated that participants bring established expectations
about memory performance into the experimental setting
(Cox & Dobbins, 2011; Koop, Criss, & Malmberg, 2015;
Turner, Van Zandt, & Brown, 2011). For example, HRs and
FARs are shockingly similar between groups of participants
presented with test lists consisting entirely of targets or foils
and individuals in standard test lists consisting of half foils and
half targets (Cox & Dobbins, 2011; Koop et al., 2015).
Obviously, participants faced with a test list consisting entirely
of targets should have a different understanding of the strength
of items at test in an all-target list than in an all-foil list. In fact,
participants only dramatically altered their responses when
they were provided with feedback that indicated
preexperimental expectations no longer held (Koop et al.,
2015). These data demonstrated that individuals do not come
into the test setting as Bblank slates.^ Participants have a life-
time of experience making recognition decisions and have
therefore developed an understanding about what is likely to
be an accurate memory and what is not (Wixted & Gaitan,
2002). It is reasonable to assume these Bpreexperimental
priors^ will be maintained unless it becomes apparent to the
participant they no longer hold (Turner et al., 2011). Thus, for
participants to adjust their criterion in a typical SBME design,
they must have an accurate understanding about the effects of
different encoding manipulations. This claim will be the focus
of Experiments 3 and 4.

To summarize, strength-based criterion shifts require two
things: sufficient trials for establishing expectations about
memory quality, and/or manipulations that make participants
acutely aware of differences between strong and weak trials.

Experiments 1 and 2 explored whether SBMEs persist with an
insufficient number of trials to establish a criterion, while
Experiments 3 and 4 examined participants’ awareness of
the effects of encoding manipulations. Finally, Experiment 5
combines both of these manipulations to assess memory and
awareness of encoding manipulations in short study–test lists.

Experiments 1 and 2

In Experiments 1 and 2, we use lists that are so short so as to
eliminate criterion shifts. The above discussion focuses on the
number of test trials necessary to establish a criterion, but, of
course, study trials could also help participants establish an
appropriate criterion (e.g., Hirshman, 1995). To eliminate ei-
ther possibility, we use both short study and test lists. The
experiments are very similar and only differ in that
Experiment 2 is slightly more difficult by virtue of different
encoding tasks and a slightly longer delay between study and
test.

Method

Participants Seventy-two introductory psychology students
from Syracuse University participated in Experiment 1.
Thirty-nine introductory psychology students from Eastern
Mennonite University participated in Experiment 2.
Participant data were excluded from analysis if they had a d′
of less than 0.5 on either of the two study–test cycles (de-
scribed below). This exclusion criterion resulted in removing
three participants from Experiment 1, and two participants
from Experiment 2. All participants received partial fulfill-
ment of course requirements in exchange for their
participation.

Design and materials In both experiments, participants com-
pleted two study–test cycles containing 10 items each. For
each participant, one of these study–test cycles was strong
and one was weak, with the order randomly determined across
participants. Participants in Experiment 1 also completed an
additional study–test cycle that followed the two short blocks
examined here. These data were collected for a different re-
search project, one focused on theoretical questions outside
the scope of the present work and will therefore not be
discussed further. Participants in Experiment 2 only complet-
ed the two short study–test cycles. The test phase was single-
item recognition, where participants were asked to make old/
new decisions on five targets and five foils. The order of
targets and foils at test was randomized. Thus, the data ana-
lyzed here come from a 2 (block strength: strong vs. weak) × 2
(item type: target vs. foil) repeated-measures design.

Word stimuli were pulled from a pool of 800 high norma-
tive frequency words between four and 11 letters in length

Mem Cogn (2019) 47:842–854 845


(median = 5) and ranging between 12.99 and 9 log frequency
(M = 10.46) in the Hyperspace Analog to Language Corpus
(Balota et al., 2007). For each participant, a subset of 30 words
were selected and randomly assigned to condition. Stimulus
presentation and recording of responses were conducted with
the Psychtoolbox add-on for MATLAB (Brainard, 1997;
Kleiner et al., 2007).

Procedure Upon arrival to the experiment, all participants
were given an informed consent form. Next, participants read
instructions that informed the participants that they would
study a list of words and later have their memory for those
words tested. At study, participants were given either a weak
or strong encoding prompt for each trial depending on the
strength of that particular cycle. The weak encoding task for
Experiment 1 asked participants to indicate whether or not the
word contained the letter e. For strong encoding trials, partic-
ipants indicated whether or not they considered the word to be
pleasant (Fig. 1).

In Experiment 2, the weak encoding task asked participants
whether or not the stimulus was written in red, whereas the
strong encoding task asked participants whether the stimulus
was easy to imagine. For all prompts, yes and no responses
were indicated by a single key press, (Bz^ or B?^ key,
counterbalanced between participants). The study phase was
self-paced with the lone constraint that a response could not be

entered until a minimum of 1.5 seconds after stimulus
presentation.

After each study phase, there was a math distractor task. In
Experiment 1, this task lasted for 45 seconds, whereas in
Experiment 2 it lasted for 90 seconds. After completing the
distractor task, participants were given instructions for the test
phase. At test, participants were presented with a single test
word and asked to indicate whether that word was Bold^ (pre-
viously studied word) or Bnew^ (word that was not studied).
Participants indicated their choice by clicking on Bold^ or
Bnew^ response boxes located in the upper left and right cor-
ners of the screen (left/right order was counterbalanced be-
tween participants). After providing a recognition judgment,
the stimulus disappeared, and participants clicked on a start
button at the bottom of the screen to begin the next trial.
Following the first recognition test, participants were allowed
to take a short break (if necessary) prior to beginning the
second study–test cycle. After completing all tests, partici-
pants were asked if they had any questions and were thanked
for their participation.

Results

All analyses were conducted using JASP (JASP Team, 2018).
In addition to reporting standard frequentist statistics we also
report Bayes factors (BF). Bayes factors provide a continuous

Fig. 1 Pure-strength study–test cycle in Experiment 1. Participants
completed either 10 items using a weak encoding task (two leftmost study
panels) or a strong encoding task (two rightmost study panels), followed
by a 45 second distractor task, and 10 single-item test trials. A second
study–test cycle followed the first and used whichever encoding task

participants did not see during Block 1. Experiment 2 used the same
design with different encoding tasks and a longer distractor. In the weak
condition, participants were asked, BIs this word written in red?^ In the
strong condition, participants were asked, BIs this word easy to imagine?^

846 Mem Cogn (2019) 47:842–854


estimate of relative evidence. In particular, as presented here
BF1 indicate the ratio of evidence for the model with an effect
compared with the null model. Values greater than 1 indicate
evidence for a model with an effect, and values below 1 indi-
cate support for the null model.

In order to explore whether a strength-based mirror effect
was obtained under conditions where criterion shifts would
not be expected, we first conducted a 2 (strength: strong vs.
weak) × 2 (trial type: target vs. foil) repeated-measures
ANOVA. In Experiment 1, the data (see Table 1) demonstrat-
ed a Strength × Trial Type interaction, F(1, 68) = 14.80, p <
.001, ηp

2 = .18, BF1 = 72.26.
2 This interaction was due to an

increase in HR from weak to strong conditions accompanied
by a decrease in FAR. Planned comparisons confirmed an
increase in HR from weak (M = .88, SE = .02) to strong (M
= .97, SE = .01) conditions, t(68) = 3.41 , p = .001, d = .41,
BF1 = 23.17. There was a numerical decrease in FAR between
weak (M = .07, SE = .01) and strong (M = .05, SE = .01)
conditions, but it was not statistically significant, t(68) =
1.35, p = .182, d = .16, BF1 = .31.

In Experiment 2, the data again demonstrated a Strength ×
Trial Type interaction, F(1, 35) = 59.43, p < .001, ηp

2 = .62,
BF1 = 1.86e + 9. Strong lists elicited a higher HR and a lower
FAR than weak lists. Planned comparisons again confirmed
an increase in HR from weak (M = .72, SE = .03) to strong (M
= .98, SE = .01) lists, t(36) = 7.51, p < .001, d = 1.24, BF1 =
1.78e + 6. Unlike Experiment 1, FARs were also reliably
lower for strong lists (M = .03, SE = .01) relative to weak lists
(M = .08, SE = .02), t(36) = 2.05, p = .048, d = 0.34, BF1 =
1.14.

Generally speaking, this experimental design is a challenge
because limiting memory to such a short list necessarily re-
sults in near ceiling performance, making it difficult to detect
changes in FAR. Therefore, we performed an exploratory
analysis that collapsed across these two highly similar studies
to see whether the associated increase in power would provide
clarity, especially with regard to the FAR. As expected, the 2
(experiment) × 2 (strength) × 2 (trial type) mixed-factors
ANOVA revealed a three-way interaction, F(1, 104) =
17.50, p < .001, ηp

2 = .14, BF1 = 229.89. As is apparent from
Table 1, this interaction is the product of the typical SBME
interaction being more pronounced in Experiment 2 than in
Experiment 1. However, the direction of the interaction is
identical. This combined data set demonstrates a reliable
difference between strong (M = .04, SE = .01) and weak
(M = .07, SE = .01) FAR, t(105) = 2.26, p = .026,
d = 0.22, BF1= 1.22, and strong (M = .97, SE = .01)
and weak (M = .83, SE = .02) HR, t(105) = 6.75, p < .001,
d = 0.66, BF1 = 1.18e + 7.

Discussion

We observed the descriptive SBME pattern of higher HR and
lower FAR for strongly encoded lists indicating that the
SBME can be elicited even under conditions where criterion
shifts are highly unlikely. However, the small magnitude of
the BFs suggest that the evidence for differences in the FAR
was not strong. We see these experiments as a Bproof of
concept^ and will return to a more rigorous (and
preregistered) evaluation of this short-list SBME in
Experiment 5.

Although a substantial body of literature suggests it takes
more than 10 trials to establish firm expectations about mem-
ory quality at test, it could be possible that participants were
able to establish expectations about memory quality in the
short study lists. In other words, it is possible that participants
quickly established an accurate expectation about the strength
of the upcoming test list (but see Turner et al., 2011, for
evidence that participants bring preexisting memory
expectations into the experimental context). This would re-
quire that participants can somewhat accurately evaluate
memory fidelity for individual study items. In Experiments 3
and 4, we provide an empirical test of how these expectations
develop over the course of study and test, using list lengths
that are more typical of SBME experiments.

Experiments 3 and 4

To best address how test expectations develop, we look to
studies assessing metaknowledge. Benjamin (2003) examined
individuals’ expectations regarding the recognizability of low-
frequency and high-frequency words. During study, Benjamin
asked participants to rate the likelihood that they would rec-
ognize a studied item on the subsequent memory test. Three
separate studies indicated that, in general, participants
incorrectly expected that they would have better memory for
high-frequency words than for low-frequency words. The in-
ability of participants to grasp—prior to test—the effects of
word frequency on recognition lead Benjamin (2003) to spec-
ulate that people may often have Bpoor self-assessment of
one’s own memory ability and, by extension, of the effects
of different variables on one’s memory^ (p. 304). Obviously,

Table 1 Hit and false-alarm rates in Experiments 1 and 2

Hit rate False-alarm rate

Weak Strong Weak Strong

Experiment 1 .88 (.02) .97 (.01) .07 (.01) .05 (.01)

Experiment 2 .72 (.03) .98 (.01) .08 (.02) .03 (.01)

Note. Standard error in parentheses

2 Here, the BF represents support for an interaction model (Strength × Trial
Type) relative to a model only containing main effects.

Mem Cogn (2019) 47:842–854 847


if such a claim were true in the standard SBME paradigm, this
would question the viability of criterion shifts to produce all
SBMEs. It seems reasonable to assume that people are aware
that repetition helps memory. However, an SBME is observed
not just when items are strengthened by repetition but also
through levels of processing, as in Experiments 1 and 2 (see
also Glanzer & Adams, 1990; Kiliç, Criss, Malmberg, &
Shiffrin, 2017; Koop & Criss, 2016). We collect memory pre-
dictions in mixed-strength study lists (Experiment 3) and
pure-strength study lists (Experiment 4) to establish whether
participants are aware that different encoding conditions lead
to different levels of subsequent memory.

Method

Participants Introductory psychology students participated in
Experiments 3 and 4. Thirty-one students from Eastern
Mennonite University participated in Experiment 3 in ex-
change for partial course credit. Forty-four students from
Syracuse University participated in Experiment 4 and were
compensated with partial course credit. As in the first two
experiments, all participants that did not achieve a d′ above
0.5 were excluded from analyses. This resulted in excluding
one participant from Experiment 3, and one participant from
Experiment 4.

Design and materials In Experiments 3 and 4, participants
completed two study–test cycles. Study lists consisted of 30
words, and test lists consisted of 60 words (30 targets and 30
foils). In Experiment 3, we presented mixed-strength study
lists. In each study list, half of the words were presented with
the weak encoding task (BDoes this word contain the letter
e?^) and half of the words were presented with the strong
encoding task (BIs this word pleasant?^). Strong and weak
items were randomly intermixed at study and test. Study lists
in Experiment 4 were pure strength and the encoding tasks
were identical to Experiment 3. Each participant completed
one weak study–test cycle and one strong study–test cycle.
The order in which participants encountered strong and weak
blocks was randomly assigned across subjects.

Stimuli were pulled from a pool of 424 medium normative
frequency words between three and 13 letters in length (me-
dian = 6) and ranging between 13.22 and 5.19 log frequency
(M = 8.87) in the Hyperspace Analog to Language Corpus
(Balota et al., 2007). For each participant, a subset of 180
words was randomly selected from this pool and randomly
assigned to strength condition (weak vs. strong). Stimuli were
presented and responses were recorded using the
Psychtoolbox add-on for MATLAB (Brainard, 1997; Kleiner
et al., 2007).

Procedure The procedure for Experiments 3 and 4 is depicted
in Fig. 2. First, participants were instructed that they would be

asked to study lists of words and later complete a test of their
memory for those words. Participants completed two study–
test cycles. The critical addition to Experiments 3 and 4 was
that we also collected participants’ predictions about their
ability to later recognize each studied word (1 = I won’t
recognize, 9 = I will recognize; Benjamin, 2003). These pre-
dictions were collected on each study trial immediately after
participants responded to the encoding task. All other details
for the study phase and subsequent distractor task were iden-
tical to Experiment 1.

The test phase in Experiment 3 was procedurally identical
to that of the previous experiments, with the exception of
length and that study was mixed strength. Experiment 4 had
one additional change. Because participants experienced pure-
strength lists in Experiment 4, it was possible to collect weak
and strong postdictions at test (Benjamin, 2003). Whenever
participants in Experiment 4 provided a Bnew^ response at
test, they were asked to respond to the question BHow likely
would you have been to remember this word if you had actu-
ally studied it?^ by providing a rating on a 1–9 scale (1 = I am
sure I would NOT recognize this word; 9 = I am sure I
WOULD recognize this word). Each test word remained on-
screen during the postdiction phase.

Results

We first examined the participants’ accuracy data (see Table 2)
to verify that the encoding manipulation had the expected
effect. In Experiment 3, participants showed a higher HR to
strongly encoded items (M = .94, SE = .01) than to weakly
encoded items (M = .82, SE = .02), t(29) = 6.36, p < .001, d =
1.16, BF1 = 28795.70. The mixed-strength design of
Experiment 3 means that it is not possible to compare weak
and strong FAR. For Experiment 4, we conducted a 2
(strength: weak vs. strong) × 2 (trial type: target vs. foil)
repeated-measures ANOVA. As expected, there was the
Strength × Trial Type interaction that is characteristic of an
SBME, F(1, 42) = 47.92, p < .001, ηp

2 = .53, BF1 = 1.69e + 5.
Strong HRs (M = .94, SE = .01) were higher than weak HRs
(M = .85, SE = .02), t(42) = 6.10, p < .001, d = .93, BF1 =
54362.28, whereas strong FARs (M = .07, SE = .01) were
lower than weak FARs (M = .14, SE = .02), t(42) = 3.73, p
= .001, d = .57, BF1 = 49.15.

Having confirmed that the encoding manipulations had the
intended effect, we turn attention to the question of whether
participants were aware of how encoding would affect later
memory (see Fig. 3). In other words, did participants expect to
have better memory for items presented with the strong versus
weak encoding task?

Experiment 3 had a single study list with strong and weak
encoding tasks intermixed at study. We evaluated participants’
predicted recognizability for all strongly studied items and all
weakly studied items. Ratings of strongly studied items (M =

848 Mem Cogn (2019) 47:842–854


6.84, SE = 0.26) did not reliably differ from ratings of weakly
studied items (M = 6.79, SE = 0.29), t(29) = 0.61, p = .550, d =
0.11, BF1 = 0.23. In Experiment 4, encoding task was manip-
ulated between lists. We again compared predicted recogniz-
ability for strong and weak items. Participants in Experiment 4
showed slightly higher predictions for strong items (M = 6.34,
SE = 0.22) than for weak items (M = 6.04, SE = 0.25), t(42) =
2.06, p = .046, d = 0.31, BF1 = 1.12.

We also collected postdictions in Experiment 4. Each time
participants provided a Bnew^ response at test, they were
asked how likely they would have been to remember that
word if it had actually been presented. Participants showed
greater postdicted confidence for strong correct rejections (M
= 5.95, SE = 0.27) than for weak correct rejections (M = 5.64,
SE = 0.27), t(42) = 2.84, p = .007, d = 0.43, BF1 = 5.41.
However, there was not a reliable difference between strong
misses (M = 4.97, SE = 0.49) and weak misses (M = 5.00, SE =
0.32), t(22) = 0.20, p = .84, d = 0.04, BF1 = 0.22. Notably, the

average confidence ratings for misses are lower than those of
correct rejections. This means that participants indicated that
they would be more likely to remember unstudied foils than
items they actually studied. One distinct possibility is that
participants simply reported confidence in their response rath-
er than postdiction judgments (or they conflated the two). A
reviewer suggested an interesting alternative interpretation.
He suggested that lower ratings for misses than correct rejec-
tions indicates that metamemory judgments are quite accurate
in the sense that participants accurately indicate that they
would not have remembered exactly those items that they
forgot (e.g., misses). In either event, these results are not in-
formative for the present research question without additional
research.

Experiment 3 Experiment 4

0

1

2

3

4

5

6

7

8

9

Weak

Strong

Pr
ed

ic
te

d 
R

ec
og

ni
za

bi
lit

y

Fig. 3 Mean predicted recognizability for studied items in Experiments 3
(mixed-strength lists) and 4 (pure-strength lists). Error bars are ±1 SE

Table 2 Hit and false-alarm rates in Experiments 3 and 4

Block order and experiment Hit rate False-alarm rate

Weak Strong Weak Strong

Experiment 3 (mixed-strength) .82 (.02) .94 (.01) .08 (.02)

Experiment 4 (pure-strength) .85 (.02) .94 (.01) .14 (.02) .07 (.01)

Note. Strong and weak study trials are intermixed in Experiment 3, there-
fore it is not possible to define weak or strong false alarm rates. Standard
error in parentheses

Fig. 2 Procedure for collecting predictions (top panel; Experiments 3 &
4) and postdictions (bottom panel; Experiment 4 only). Predictions were
collected following every trial during the study phase. Postdictions were
only collected following a BNEW^ response. In the bottom panel, no

postdiction is collected for Bjazz,^ because the participant identified it
as an old word. A postdiction is collected for Bspoke^ because it was
identified as a new word

Mem Cogn (2019) 47:842–854 849


Discussion

The goal of Experiments 3 and 4 was to assess participants’
awareness of the effects of different encoding manipula-
tions. We investigated whether participants have accurate
expectations about the effects of encoding manipulations
on memory quality, as predicted by the decisional account.
Remarkably, the data suggest no difference in predicted
recognizability between weak encoding tasks and strong
encoding tasks in Experiment 3 even though participants’
accuracy was dramatically higher for strong items. It ap-
pears as though participants do not notice clear distinctions
in memory quality at a trial-by-trial level. Experiment 4 did
show a difference between predicted recognizability for
strong and weak items. However, the small magnitude of
the BF suggests minimal evidence for the observed differ-
ence in the predicted memory. We will return to this with a
large N, preregistered evaluation of memory expectations
in Experiment 5.

Interestingly, we note that hit rates in Experiments 3 and 4 are
extremely similar. However, the ratings of predicted memorabil-
ity are higher for Experiment 3 than Experiment 4. In other
words, quite different memorability ratings do not correspond
to differences in accuracy. This is another indication that
participants are not well calibrated with respect to assessing
future memory or how encoding might affect later memory.

Our measure of predicted memory is somewhat similar
to judgments of learning (JOL). Current theorizing on
JOLs attributes them to fluency in processing the individ-
ual items and beliefs about what affects memory (e.g.,
Dunlosky, Mueller, & Tauber, 2015; Koriat, 1997). The
role of beliefs has largely been tested in terms of properties
of the items (e.g., font size), whereas here we manipulated
the encoding task that is common to all items. Consistent
with prior work, memory predictions do not reflect differences
in quality between encoding tasks like those used here. For
example, Begg, Duft, Lalonde, Melnick, and Sanvito (1989)
provided individuals with interactive or separate imagery-
based encoding tasks. Although memory performance
differed between groups, memory predictions did not.

Recall that strength-based criterion shifts require two
things: sufficient trials for adjustment and accurate expec-
tations about memory quality on the part of participants.
Experiments 1 and 2 showed preliminary evidence for an
SBME with insufficient trials to establish a criterion.
Experiments 3 and 4 showed that participants do not have
accurate expectations about memory quality. In our final
study, we collect additional data about participants’ mem-
ory expectations by asking a single question about predict-
ed memory for the entire set of targets. After all, it is pos-
sible that a post-study estimate might better characterize
expectations about the encoding conditions absent judg-
ments about the specific target items.

Experiment 5

In Experiment 5, we combine the short list design of
Experiments 1 and 2 while also assessing participants’
awareness of the specific encoding manipulations used
therein. If participants continued to show an SBME while
simultaneously failing to note the mnemonic consequences
of weak and strong encoding tasks, then a pure criterion
shift account of all SBMEs would be highly unlikely (as-
suming the standard assumption that criterion shifts are
active control processes). On the other hand, if participants
accurately assess differences in memory strength even after
short study lists, this could provide grounds for revisiting
assumptions about the speed with which criterion shifts
can occur. Experiment 5 was a preregistered study that
was identical to two additional experiments (5a and 5b)
that appeared in a previous draft of this manuscript.
Results from those studies can be found in the supplemen-
tary materials posted at (https://osf.io/bv6c3/).
Preregistration for Experiment 5 can also be found there.

Method

Participants In order to determine our sample size for
Experiment 5, we performed a power analysis using
G*Power (Faul, Erdfelder, Lang, & Buchner, 2007). We se-
lected a sample size of 120 participants because it would give
us above 1 − β = .9 with an effect size of d = .3 (roughly the
effect size on FAR from Experiment 2). To ensure that we
would accrue 120 participants after no-shows, cancellations,
and exclusions, we posted many more than 120 sessions at
Syracuse University and Eastern Mennonite University. In
total 176 individuals participated in the experiment.3 All par-
ticipants completed the experiment before we looked at any
data. Participants were compensated with partial fulfillment of
course requirements.

Design and materials Experiment 5 is a complete replication
of Experiment 2, with the addition of a single question about
the quality of participants’ memory immediately prior to the
test phase (described more fully below).

Procedure Participants received the same instructions and
completed the same study–test cycles as in Experiment 2.

3 G.K. and A.H.C. had an interesting conversation about whether to report the
first 120 participants so as to remain faithful to the preregistered sample size or
to report all participants. In the end, we agreed that it was not sensible to
discard the contribution of a large number of volunteers because we pessimis-
tically scheduled too many appointments. As Alexander DeHaven wrote,
Bpreregistration is a plan not a prison^ (https://cos.io/blog/preregistration-
plan-not-prison/). Finally, note that the pattern of results do not change with
the smaller sample size.

850 Mem Cogn (2019) 47:842–854

https://osf.io/bv6c3/
https://cos.io/blog/preregistration-plan-not-prison/
https://cos.io/blog/preregistration-plan-not-prison/


Following the 90-second distractor task and test instructions,
participants were asked an additional question about their per-
ceived memory quality. The question was as follows:

The test will be 10 words in length. Before starting the
test, we would like you to estimate how well you will
do. If you believe you will give the correct BOLD^ or
BNEW^ response to all 10 items, you would type B10.^
If you feel like you will be completely guessing, you can
expect to get around five answers correct, and should
enter B5.^

Following this question, participants then completed the 10
single-item recognition test trials just as described in
Experiment 2. After completing both a strong study–test cycle
and a weak study–test cycle (counterbalanced across partici-
pants), participants were thanked and then dismissed.

Results

Participants were excluded using the same criterion (d′ < .5 in
each study–test cycle) as the previous studies. This resulted in
excluding 22 participants. Additionally, one individual gave a
memory prediction response outside of the 0–10 scale and was
therefore excluded from analysis. In total, data from 153 par-
ticipants were analyzed.

We analyzed accuracy data using a 2 (strength: strong vs.
weak) × 2 (trial type: target vs. foil) repeated-measures
ANOVA and paired-samples t tests. Predicted recognition per-
formance following weak and strong study lists was compared
using a paired-samples t test. These analyses were
preregistered. In addition, we included Bayes factors for all
analyses, which we did not preregister.

Participants showed a Strength × Trial Type interaction,
F(1, 152) = 193.98, p < .001, ηp

2 = .56, BF1 = 8.37e + 30.
HRs were higher on strong blocks (M = .97, SE = .01) than on
weak blocks (M = .75, SE = .02), t(152) = 13.48, p < .001, d =
1.09, BF1 = 3.64e +24. There was also a reliable difference in
FAR between strong (M = .04, SE = .01) and weak blocks (M
= .10, SE = .01), t(152) = 4.26, p < .001, d = 0.34, BF1 =
412.29. Thus, the SBME was present for short study and test
blocks.

An analysis of participants’ predicted memory showed a
small difference between predictions following a strong block
(M = 7.29, SE = 0.16) and those following a weak block (M =
6.94, SE = 0.16). The statistical analysis of this effect is mixed,
t(152) = 2.05, p = .042, d = 0.17, BF1 = 0.70, with the p value
indicating support for this difference and the BF indicating no
evidence for an effect and in fact weak evidence for a null
effect.

Discussion

Experiment 5 was designed to assess whether individuals
show an SBME under conditions where a criterion shift is
unlikely and without demonstrating awareness of differing
memory quality between the two encoding tasks. The re-
sults from Experiment 5 clearly showed strong evidence
for an SBME under conditions where criterion shifts would
not be expected. Whether participants are aware that dif-
ferent encoding tasks result in differences in memory qual-
ity is ambiguous. Given the strong evidence of an SBME,
it is particularly striking that there was trivial evidence for
differences in the predicted memory. If participants are
basing a criterion shift on the outcome of encoding, then
the cognitive system must be magnifying small differences
in expected memory to rather large differences in the deci-
sion space.

General discussion

The experiments presented here have demonstrated that it
is possible to elicit an SBME under conditions inhospitable
for criterion shifts. We observed an SBME even when par-
ticipants had few items on which to establish a criterion.
We used encoding manipulations that participants did not
consistently think affected memory. We demonstrated
these findings in the first four experiments and then com-
bined them in a preregistered study with a large sample
size. Collectively, this demonstrates the presence of an
SBME under conditions where criterion shifts would not
be predicted. Further, participants do not accurately adjust
their expectations about memory quality in response to
levels of processing manipulations. Even if participants
could set a specific criterion for the list after exposure to
only a few items, they would set an inappropriate criterion
(to generate an SBME) because they estimate that
encoding tasks produce minimal differences in memory
accuracy.

Prior research has established that an SBME can be
found when differentiation is not present. This, of course,
is consistent with all models of memory because no
models dispute the possibility of a criterion or the ability
of participants to modify a criterion to suit the needs of
the individual or context. Here, we find that an SBME is
observed under conditions that would not seem to support
a criterion shift of the sort required to produce an SBME.
This does not imply that differentiation is responsible for
the pattern of HR and FAR; there could be some alterna-
tive mechanism that produces an SBME. However, our
research does suggest that a criterion shift is unlikely to
be the single mechanism responsible for this pattern of
data.

Mem Cogn (2019) 47:842–854 851


Implications for strength-based mirror effects

Because our aim in the present work was to address the debate
between decisional and mnemonic accounts of SBMEs, we
have ignored a more nuanced perspective on criterion shifts.
Recent work has raised the possibility that individuals can
adjust decision criteria on an item-by-item basis, but making
the decision as to what strength to expect is arduous (Starns &
Olchowski, 2015). Thus, rather than a failure to elicit criterion
shifts, much of the literature reviewed in the introduction
could potentially be framed as failures to effectively manipu-
late strength expectations.

Starns and Olchowski (2015; see also Franks & Hicks,
2016) produced item-by-item shifts in FARs by making
encoding strength covary with the side of the screen on which
items were presented and requiring participants to use differ-
ent response keys for strong and weak items. For example,
when an item was presented on the right side of the screen, it
was either a strong target or a foil, whereas items presented on
the left side of the screen were either weak targets or foils.
Providing an Bold^ response then required the use of different
keys for items presented on the left or right side of the screen.

While this work may very well demonstrate that partici-
pants often do not shift their strength expectations (rather than
criteria) in typical mixed-strength recognition studies, it does
not affect the interpretation of our results. First, the fundamen-
tal assumption is that without external affordances, partici-
pants take time to adapt to a new decision environment.
Based on this work, one presumes that the literature demon-
strating that criterion shifts take time (e.g., Brown & Steyvers,
2005; Hicks & Starns, 2014; Verde & Rotello, 2007) could
easily be reframed as slow decisions to adopt different
strength expectations. However, the fact remains that differ-
ences in FAR are only observable on an item-by-item basis
when significant affordances are provided. The affordances
may include things like color cuing (Hicks & Starns, 2014;
Starns & Olchowski, 2015; Stretch & Wixted, 1998), pure-
strength blocking at test (Hicks & Starns, 2014; Starns et al.,
2012; Verde & Rotello, 2007), or forcing individuals to ac-
knowledge strength distributions through different response
keys (Starns & Olchowski, 2015). In short, without explicit
cues, it takes time for people to adapt to changing strength
environments. Thus, our results challenge the notion that
any adjustment of memory strength expectation occurred dur-
ing the shortened pure-strength SBME design.

Again, one might ask the question: If it is so hard to elicit
this adjustment (whether criterion shift or decision about ex-
pected strength), why is it that pure-strength lists reliably pro-
duce an SBME without going to any great lengths to alert (or
force) participants to acknowledge differences in strength be-
tween lists? One possibility is that the encoding tasks provide
this explicit affordance. After all, if items are strengthened by
repetition, participants Bshould certainly expect to have better

memory after an entire list of words studied five times than
after an entire list of words studied once^ (Starns &
Olchowski, 2015, p. 57). However, all the experiments pre-
sented above used less intuitive strength manipulations than
mere repetition. Experiments 3, 4, and 5 directly tested this
assumption. Those data suggested that participants do not
consistently form dramatically different expectations for
strong encoding tasks and weak encoding tasks. One possible
explanation for why repetition leads to clear memory expec-
tations, whereas our strength manipulations do not, is because
memory expectations are influenced by the ease with which
study items are processed (Begg et al., 1989). Repetition ma-
nipulations make items easier to process and therefore have a
much larger impact on memorability expectations than do the
manipulations used in this work. For example, both strong and
weak tasks are relatively easy for participants to perform and
therefore do not have significant effects on expected
memorability. This account is very speculative, and future
work could examine this possibility in greater detail. In
summary, although the results shown by Starns and
Olchowski (2015) certainly demonstrated that criterion shifts
can produce a mirror effect, it is premature to assume that such
item-by-item criterion shifts underlie all demonstrations of
SBMEs.

Typically, data demonstrating SBMEs have fallen into two
competing (and often mutually exclusive) camps: the deci-
sional account and the mnemonic account. This debate has
generated a significant amount of data. Some of these data
have produced an SBME under conditions not predicted by
the mnemonic account, whereas the present data produced an
SBME under conditions not predicted by the decisional ac-
count. After surveying this literature, we believe it will be
most fruitful to take a Bboth/and^ approach to the SBME
rather than Beither/or.^ Taken as a whole, the literature indi-
cates that neither a pure-decisional nor a pure-mnemonic ac-
count explains all SBMEs. We also suggest that these data
should lead to changes in terminology. Rather than speaking
of the SBME as if this is a single phenomenon, these results
suggest it is more appropriate to discuss an SBME.

Implications for memory theory

While SBMEs have produced a significant amount of re-
search, the effect is only interesting insofar as it tell us some-
thing about the nature of the memory processes used to pro-
duce it. In other words, BWhat drives the SBME?^ is not a
particularly important question, whereas BWhat do SBMEs
tell us about the nature of memory?^ is. Focusing too narrow-
ly on SBMEs may lead to an increasingly compartmentalized
memory literature—a broader problem that has led to a fair
amount of handwringing (Criss & Howard, 2015; Hintzman,
2011; Malmberg, 2008). In short, theorists have raised con-
cerns about the generalizability of memory models. At times it

852 Mem Cogn (2019) 47:842–854


appears as though accounts are created ad hoc for individual
tasks even though similar mechanisms should underlie perfor-
mance of the memory system in a number of domains. We
briefly highlight this debate because one argument occasion-
ally used to support a pure-decisional account is that it is
parsimonious or commonsensical. Although a pure-
decisional account may seem to be a relatively straightforward
explanation of SBMEs (though differentiation is not a partic-
ularly complex explanation either), we see it as contributing to
the fractured landscape of memory models.

The critical question, then, is whether these data advance
our broad understanding of memory, or are we simply Basking
what causes some characteristic twitch in the data^
(Hintzman, 2011, p. 267). Our hope is that rather than merely
cataloguing memory effects and developing ad hoc models to
explain them, we can identify models that do reasonably well
at accommodating data from a wide variety of tasks. In the
present context, this means identifying a model that can ac-
commodate the Bboth/and^ approach rather than only a Bpure
decisional^ or Bpure mnemonic^ approach. In other words, a
model should include a mechanism for decisional processes
(i.e., an adjustable criterion) as well as incorporating mnemon-
ic processes like differentiation exactly as suggested by
Atkinson and Shiffrin (1968).

We briefly highlight the retrieving effectively from memo-
ry (REM; Shiffrin & Steyvers, 1997) framework for its ability
to use common mechanisms to produce human data across a
variety of tasks (Malmberg, 2008). Unlike signal detection
theory, REM is a true process model that provides an account
of how memories are encoded, stored, and retrieved. For ex-
ample, versions of REM have been applied to recognition
(Shiffrin & Steyvers, 1997), free recall (Lehman &
Malmberg, 2013), and cued recall (Diller, Nobel, & Shiffrin,
2001; Wilson & Criss, 2017). Although this work does not
necessarily fulfill Hintzman’s (2011) call to study memory Bin
the wild,^ we believe the success of REM across a number of
different experimental tasks begins to speak to more general
characteristics of memory, like differentiation. Concerning
SBMEs, REM incorporates both decisional processes and dif-
ferentiation, which could conceivably cover the breadth of
data collected on the SBME to date.

Finally, we see the present work as addressing a common
problem also noted by Atkinson and Shiffrin (1968)—specif-
ically, the fact that multiple control processes may give rise to
similar patterns of memory performance. For example, many
SBME studies used fairly transparent strengthening opera-
tions like repetition. Use of this type of strengthening task
most likely elicits metacognitive criterion shifts that are absent
for encoding processes like those used in the present experi-
ments. It has taken a sizable amount of research to establish
that some SBMEs may be the product of repetition strategies
and subsequent criterion shifts, whereas others (like depth of
processing) may somewhat automatically elicit an SBME via

differentiation. While the present work indicates that a depth
of processing manipulation produces SBMEs without a crite-
rion shift, it is still unclear exactly what people are doing
during such encoding tasks. Future work will need to more
effectively model what, exactly, individuals are doing during
such manipulations.

Acknowledgements Data and supplementary materials are available at
(https://osf.io/bv6c3/).

Experiment 5 is preregistered at (https://aspredicted.org/b9ne5.pdf).
We thank the following students for assistance collecting data: Michael
Austin, Lara Weaver, Andrew Peltier, Sophi Hartman, and Olivia Dalke.

Publisher’s note Springer Nature remains neutral with regard to jurisdic-
tional claims in published maps and institutional affiliations.

References

Atkinson, R. C., & Shiffrin, R. M. (1968). Human memory: A proposed
system and its control processes. In K. W. Spence & J. T. Spence
(Eds.), The psychology of learning and motivation (Vol. 2, pp. 89–
195). New York, NY: Academic Press.

Balota, D. A., Yap, M. J., Cortese, M. J., Hutchison, K. A., Kessler, B.,
Loftis, B., … Treiman, R. (2007). The English Lexicon Project.
Behavior Research Methods, 39, 445–459.

Begg, I., Duft, S., Lalonde, P., Melnick, R., & Sanvito, J. (1989). Memory
predictions are based on ease of processing. Journal of Memory and
Language, 28, 610–632.

Benjamin, A. S. (2003). Predicting and postdicting the effects of word
frequency on memory. Memory & Cognition, 31, 297–305.

Benjamin, A. S., & Bawa, S. (2004). Distractor plausibility and criterion
placement in recognition. Journal of Memory and Language, 51,
159–172.

Brainard, D. H. (1997). The Psychophysics Toolbox. Spatial Vision, 10,
433–436.

Brown, S., & Steyvers, M. (2005). The dynamics of experimentally in-
duced criterion shifts. Journal of Experimental Psychology:
Learning, Memory, and Cognition, 31, 587–599.

Brown, S., Steyvers, M., & Hemmer, P. (2007). Modeling experimentally
induced strategy shifts. Psychological Science, 18, 40–45.

Bruno, D., Higham, P. A., & Perfect, T. J. (2009). Global subjective
memorability and the strength-based mirror effect in recognition
memory. Memory & Cognition, 37, 807–818.

Cox, J. C., & Dobbins, I. G. (2011). The striking similarities between
standard, distractor-free, and target-free recognition. Memory &
Cognition, 19, 925–940.

Criss, A. H. (2006). The consequences of differentiation in episodic mem-
ory: Similarity and the strength based mirror effect. Journal of
Memory and Language, 55, 461–478.

Criss, A. H. (2009). The distribution of subjective memory strength: List
strength and response bias. Cognitive Psychology, 59, 297–319.

Criss, A. H. (2010). Differentiation and response bias in episodic mem-
ory: Evidence from reaction time distributions. Journal of
Experimental Psychology: Learning, Memory, and Cognition, 36,
484.

Criss, A. H. & Howard, M. (2015). Models of episodic memory. In J.
Busemeyer, J. T. Townsend, Z. Wang, & A. Eidels (Eds.) Oxford
handbook of computational and mathematical psychology. New
York, NY: Oxford University Press.

Criss, A. H., & Koop, G. J. (2015). Differentiation in episodic memory. In
J. Raaijmakers, A. H. Criss, R. Goldstone, R. Nosofsky, & M.
Steyvers (Eds.), Cognitive modeling in perception and memory: A

Mem Cogn (2019) 47:842–854 853

https://osf.io/bv6c3/
https://aspredicted.org/b9ne5.pdf


Festschrift for Richard M. Shiffrin (pp. 112–125). New York, NY:
Psychology Press.

Diller, D. E., Nobel, P. A., & Shiffrin, R. M. (2001). An ARC–REM
model for accuracy and response time in recognition and recall.
Journal of Experimental Psychology: Learning, Memory, and
Cognition, 27(2), 414–435.

Dunlosky, J., Mueller, M. L., & Tauber, S. K. (2015). The contribution of
processing fluency (and beliefs) to people’s judgments of learning.
In D. S. Lindsay, A. P. Yonelinas, H. I. Roediger, D. S. Lindsay, A. P.
Yonelinas, & H. I. Roediger (Eds.), Remembering: Attributions,
processes, and control in human memory: Essays in honor of
Larry Jacoby (pp. 46–64). New York, NY: Psychology Press.

Estes, W. K., & Maddox, W. T. (1995). Interactions of stimulus attributes,
base rates, and feedback in recognition. Journal of Experimental
Psychology: Learning, Memory, and Cognition, 21, 1075–1095.

Faul, F., Erdfelder, E., Lang, A.-G., & Buchner, A. (2007). G*Power 3: A
flexible statistical power analysis program for the social, behavioral,
and biomedical sciences. Behavior Research Methods, 39(2), 175–
191. https://doi.org/10.3758/BF03193146

Franks, B. A., & Hicks, J. L. (2016). The reliability of criterion shifting in
recognition memory is task dependent. Memory & Cognition, 44,
1215–1227.

Glanzer, M., & Adams, J. K. (1990). The mirror effect in recognition
memory: Data and theory. Journal of Experimental Psychology:
Learning, Memory, and Cognition, 16, 5–16.

Glanzer, M., Adams, J. K., Iverson, G. J., & Kim, K. (1993). The regu-
larities of recognition memory. Psychological Review, 100(3), 546–
567.

Green, D. M., & Swets, J. A. (1966). Signal detection theory and
psychophysics. New York: Wiley.

Hicks, J. L., & Starns, J. J. (2014). Strength cues and blocking at test
promote reliable within-list criterion shifts in recognition memory.
Memory & Cognition, 42, 742–754.

Hintzman, D. L. (2011). Research strategy in the study of memory: Fads,
fallacies, and the search for the Bcoordinates of truth^. Perspectives
on Psychological Science, 6(3), 253–271.

Higham, P. A., Perfect, T. J., & Bruno, D. (2009). Investigating strength
and frequency effects in recognition memory using Type-2 signal
detection theory. Journal of Experimental Psychology: Learning,
Memory, and Cognition, 35, 57–80.

Hirshman, E. (1995). Decision processes in recognition memory:
Criterion shifts and the list-strength paradigm. Journal of
Experimental Psychology: Learning, Memory, and Cognition, 21,
302–313.

JASP Team. (2018). JASP (Version 0.9)[Computer software]. Retrieved
from https://jasp-stats.org/

Kiliç, A., Criss, A. H., Malmberg, K. J., & Shiffrin, R. M. (2017). Models
that allow us to perceive the world more accurately also allow us to
remember past events more accurately via differentiation. Cognitive
Psychology, 92, 65–86.

Kleiner, M., Brainard, D. H., Pelli, D. G., Broussard, C., Wolfe, T., &
Niehorster, D. (2007). What’s new in Psychtoolbox-3? Perception,
36(14), 14.

Koop, G. J., & Criss, A. H. (2016). The response dynamics of recognition
memory: Sensitivity and bias. Journal of Experimental Psychology:
Learning, Memory, and Cognition, 42(5), 671–685.

Koop, G. J., Criss, A. H., & Malmberg, K. J. (2015). The role of mne-
monic processes in pure-target and pure-foil recognition memory.
Psychonomic Bulletin & Review, 22(2), 509–516.

Koriat, A. (1997). Monitoring one’s own knowledge during study: A cue
utilization approach to judgments of learning. Journal of
Experimental Psychology: General, 126, 349–370.

Lehman, M., & Malmberg, K. J. (2013). A buffer model of memory
encoding and temporal correlations in retrieval. Psychological
Review, 120(1), 155–189.

Macmillan, N. A., & Creelman, C. D. (2004). Detection theory: A user’s
guide. New York, NY: Cambridge University Press.

Malmberg, K. J. (2008). Recognition memory: A review of the critical
findings and an integrated theory for relating them. Cognitive
Psychology, 57(4), 335–384.

McClelland, J. L., & Chappell, M. (1998). Familiarity breeds differenti-
ation: A subjective-likelihood approach to the effects of experience
in recognition memory. Psychological Review, 105(4), 724–760.

Morrell, H. E., Gaitan, S., & Wixted, J. T. (2002). On the nature of the
decision axis in signal-detection-based models of recognition mem-
ory. Journal of Experimental Psychology: Learning, Memory, and
Cognition, 28, 1095–1110.

Parks, T. E. (1966). Signal-detectability theory of recognition-memory
performance. Psychological Review, 73, 190–204.

Rhodes, M. G., & Jacoby, L. L. (2007). On the dynamic nature of re-
sponse criterion in recognition memory: Effects of base rate, aware-
ness, and feedback. Journal of Experimental Psychology: Learning,
Memory, and Cognition, 33, 305–320.

Shiffrin, R. M., & Steyvers, M. (1997). A model for recognition memory:
REM—Retrieving effectively from memory. Psychonomic Bulletin
& Review, 4, 145–166.

Starns, J. J., & Olchowski, J. E. (2015). Shifting the criterion is not the
difficult part of trial-by-trial criterion shifts in recognition memory.
Memory & Cognition, 43(1), 49–59.

Starns, J. J., Ratcliff, R., & White, C. N. (2012). Diffusion model drift
rates can be influenced by decision processes: An analysis of the
strength-based mirror effect. Journal of Experimental Psychology:
Learning, Memory, and Cognition, 38, 1137–1151.

Starns, J. J., White, C. N., & Ratcliff, R. (2010). A direct test of the
differentiation mechanism: REM, BCDMEM, and the strength-
based mirror effect in recognition memory. Journal of Memory
and Language, 63, 18–34.

Stretch, V., & Wixted, J. T. (1998). On the difference between strength-
based and frequency-based mirror effects in recognition memory.
Journal of Experimental Psychology: Learning, Memory, and
Cognition, 24, 1379–1396.

Turner, B. M., Van Zandt, T., & Brown, S. (2011). A dynamic stimulus-
driven model of signal detection. Psychological Review, 118, 583–
613.

Verde, M. F., & Rotello, C. M. (2007). Memory strength and the decision
process in recognition memory. Memory & Cognition, 35, 254–262.

Wilson, J. H., & Criss, A. H. (2017). The list strength effect in cued recall.
Journal of Memory and Language, 95, 78-88.

Wixted, J. T., & Gaitan, S. C. (2002). Cognitive theories as reinforcement
history surrogates: The case of likelihood ratio models of human
recognition memory. Animal Learning & Behavior, 30(4), 289–305.

854 Mem Cogn (2019) 47:842–854

https://doi.org/10.3758/BF03193146
https://jasp-stats.org/

	A strength-based mirror effect persists even when criterion shifts are unlikely
	Abstract
	Significance of the strength-based mirror effect to memory theory
	Under what conditions do criterion shifts occur?
	Why are pure-strength strength-based mirror effects so reliable?
	Experiments 1 and 2
	Method
	Results
	Discussion

	Experiments 3 and 4
	Method
	Results
	Discussion

	Experiment 5
	Method
	Results
	Discussion

	General discussion
	Implications for strength-based mirror effects
	Implications for memory theory

	References