Evidence Based Library and Information Practice

Commentary

A Case for the Use of Nonparametric Statistical Methods in Library Research

Megan Hodge
Assistant Head for Teaching & Learning
VCU Libraries
Virginia Commonwealth University
Richmond, Virginia, United States of America
Email: mlhodge@gmail.com

Received: 28 Feb. 2019 Accepted: 1 May 2019

2019 Hodge. This is an Open Access article distributed under the terms of the Creative Commons‐Attribution‐Noncommercial‐Share Alike License 4.0 International (http://creativecommons.org/licenses/by-nc-sa/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly attributed, not used for commercial purposes, and, if transformed, the resulting work is redistributed under the same or similar license to this one.

DOI: 10.18438/eblip29563

If called upon to name statistical methods, the average librarian would likely reply with examples such as correlation and t-test. Those with more experience conducting research might name more comparatively exotic tests such as MANCOVA or factor analysis. These are all examples of what are known as parametric statistical tests: tests designed to identify a limited number of things about a data set where most characteristics of the data are already known or assumed. As parametric tests depend upon these assumptions, librarians who intend to use parametric tests must take care to collect data with these assumptions in mind.

While these assumptions vary somewhat from test to test, some are common to most parametric tests. One is the assumption that the data at least approximately resembles a normal distribution (also known as a “bell-shaped curve”), with most scores falling around a central score, fewer scores falling further from that central score, and few or no extreme scores. Another is the assumption that data is measured on a continuous scale, with the dependent (observed) variable measured on a real numerical scale, such as GRE scores or number of program attendees. A third common assumption is that there is a minimum number of participants in each group.

If even one of these assumptions is not true, then it is likely that parametric tests should not be used. Library research often violates these assumptions, with data that may be heavily skewed, with a tendency towards larger or smaller values; a variable of interest that is often categorical, or non-numeric (e.g., emotions elicited by the library: anxiety, gratitude, wonder, frustration), or that may have an order but not a numerical value naturally associated with that order (e.g., degree of comfort using a database: not comfortable, neutral, comfortable); finally, sample sizes in library research are often small.

There are strategies that researchers can use to reduce the likelihood of these issues: rewrite questions to use a continuous rather than non-continuous scale; develop a plan to recruit a larger number of participants; remove outliers from the data. Sometimes these strategies are not feasible, however: the data may already have been collected and the sample size cannot be increased; the variable of interest cannot be measured on a continuous scale; the outliers are valid, if inconvenient, scores.

Fortunately, there is an alternative to parametric statistical tests: nonparametric statistical tests. Nonparametric tests tend not to rely upon the same assumptions required by parametric tests. Instead, nonparametric tests rely upon the median as a measure of a data set’s central tendency, rather than the mean, which is the measure used by parametric methods. The mean is influenced by outliers in the data set; the median is not. Nonparametric alternatives exist for most common parametric methods, including ANOVAs, Pearson product-moment correlations, and t-tests.

Using a parametric statistical test when one or more of that test’s core assumptions have been violated compromises the validity of the inferences that can be drawn from the test results and, by extension, the rigor of the research. In this case, the term ‘inferences’ refers to the conclusions that may be drawn from a test’s results. It is usually not possible to survey or test every member in the population of interest (for example, academic librarians who have advanced into middle management positions within the last five years), and as such, inferential statistical tests may be used on a much smaller sample of that population to make inferences (generalizations) about that larger population. Parametric tests can have greater power to detect statistically significant differences and effects than their nonparametric equivalents; in other words, they can be more sensitive to effects and differences that are smaller in scale. However, using a parametric test when one or more of its assumptions has been violated may result in an inaccurate representation of the data — for example, when the mean is skewed far from the centre of the data by a few extreme values — which in turn means the inferences made about the larger population from which the sample was drawn may be flawed or inaccurate. Therefore, nonparametric tests, when called for, increase the rigor of a study’s conclusions and the extent to which such conclusions are justified for use in evidence based practice.

A number of research scenarios common to library scholarship warrant the use of nonparametric statistical methods. Their use may be called for in order to increase a study’s internal validity (the extent to which the study is able to investigate the topic of interest), to increase the study’s statistical rigor, or both. Several of these research scenarios are described below.

Surveys are a popular research method for librarians, as is evident from the number of requests for participation that come through email discussion lists. Many of the sorts of questions that are asked in librarian-designed surveys would best be analyzed with nonparametric statistical methods, as our research interests often tend to elicit categorical or ordinal data. For example, librarians often employ Likert scale questions. These sorts of questions ask participants to respond on a five-point scale whether they strongly disagree, disagree, neither disagree nor agree, agree, or strongly agree with the question stem. Likert-type questions, which use similar scales but which may have more points or ask about frequency rather than agreement, are also common. Ideally, in addition to identifying the construct(s) or variable(s) of interest, survey designers will have also identified all of the subconstructs making up the construct(s). For example, the construct of library anxiety might have attitudinal, cognitive, and behavioral subconstructs. A rigorous survey will have at least three questions that speak to each subconstruct of library anxiety; the survey designer will have determined the survey’s construct validity (the extent to which the subconstructs do or do not represent all aspects of the construct itself); and evaluated whether the questions themselves adequately speak to each subconstruct. To analyze data collected from Likert scales, the response options are converted to artificial scores, with, for example, a ‘strongly agree’ converted to a one, an ‘agree’ converted to a two, and so on. Responses to Likert or Likert-type questions designed in this way allow for responses for each subconstruct to be combined, resulting in data on an interval scale that may be analyzed with parametric statistical methods.

If, however, there are only one or two questions that speak to each subconstruct or construct, the data created will be ordinal in scale: the difference in strength of feeling between one respondent’s “strongly agree” and “agree” may not be the same as the difference between that respondent’s “agree” and “neither agree or disagree.” Further, the differences in strength of feeling are likely to differ between respondents. For example, a respondent who only slightly agrees with the question stem, and another who wholeheartedly agrees but does not consider their agreement ‘strong,’ may both choose a response of “agree.” And, if there are few survey respondents, there may not be enough responses to meet the minimum number required for the anticipated parametric statistical test (for example, 15 per group for a t-test), or the data may not have a normal (bell-shaped) distribution: responses may be heavily skewed. All of these scenarios warrant the use of nonparametric tests.

Another common type of question on librarian-designed surveys are ranking questions. For example: “Please rank the following methods of receiving information from ACRL in the order in which you are most likely to use them.” “Please rank the usefulness of each of the topics you learned about in today’s webinar.” “Please rank the following mediums of professional development in order of their desirability.” Before analyzing the statistical significance of the data distribution, it is important to first assess whether respondents agree in their rankings: a given item may appear to be the most popular, but upon reviewing the data it may be revealed that the item was ranked last by a good number of respondents. This requires a nonparametric test that, essentially, tests inter-rater reliability on a large scale.

Quasi-experimental studies that evaluate the effectiveness of a program or instructional strategy are also common in the library literature. In all but the rarest of occasions, however, library studies do not have sufficient participants to meet the minimum threshold for the parametric tests librarians commonly use for these research designs, such as a t-test or ANCOVA (at least 15 per group, or 30 in a single group). Parametric statistical methods are influenced by outliers and therefore require a minimum number of participants to counteract the effect of any outliers. Additionally, most if not all parametric methods assume independence of observations: that each participant has received the treatment independent of all other participants. Independence of observations is important both because it ensures participants do not influence each other’s scores, but also because it mitigates the risk of systematic bias in the scores. Systemic bias could be introduced in many ways: a fire alarm, resulting in all students in a class missing the same piece of content; a discussion that takes place in one class section but not another; or seemingly minor differences in delivery between classes offered by different librarians. In short, a librarian who wishes to evaluate the effectiveness of a lesson taught to one class of 20 students has an n of 1, not 20. Unless the librarian is teaching for a course that has many sections, such as a first-year writing course, or is willing to collect data over multiple years (which introduces validity threats of its own), it is likely that the librarian will have a very small n. Data collected from these small or radically non-normal samples should be analyzed using nonparametric methods such as the Mann-Whitney U test or the sign test (alternatives for the independent samples t-test and paired samples t-test, respectively), which do not rely upon the assumption of a normal distribution.

Further, nonparametric tests may also increase librarians’ understanding of the practical significance of their research. Statistical significance, or the likelihood of the findings not being due to chance, can be manipulated by increasing sample size; with a sufficiently large sample, most measured relationships/differences will be found to be statistically significant. A nonparametric test such as Kendall’s W evaluates agreement among a large number of raters, is not affected by sample size, and will, for example, allow the researcher to determine the extent to which survey-takers agree on the order of ten ranked items.

One explanation for nonparametric tests’ relative obscurity may be their lack of power (ability to detect small differences/associations between groups) when compared with their parametric counterparts. However, data that do not meet the assumptions undergirding parametric tests can in some cases be more powerfully analyzed with nonparametric tests. More statistical power results not just in a greater ability to detect differences or associations between groups, but in the statistical significance of the difference/association, or all-important p-value, to be much stronger.

These are just a few of the reasons that nonparametric statistical methods are more appropriate than parametric tests for many of the research designs favored by librarians. When used appropriately, nonparametric statistical methods can result in research findings of greater statistical validity and explanatory power. The subscription-based (but inexpensive) website Laerd Statistics is recommended as a resource for librarians wishing to identify nonparametric alternatives to specific parametric tests or learn more about nonparametric methods.