The Journal of Experimental Education, 2002, 71(1), 83–92
Problems With Null Hypothesis Significance Testing (NHST): What Do the Textbooks Say? JEFFREY A. GLINER NANCY L. LEECH GEORGE A. MORGAN Colorado State University ABSTRACT. The first of 3 objectives in this study was to address the major problem with Null Hypothesis Significance Testing (NHST) and 2 common misconceptions related to NHST that cause confusion for students and researchers. The misconcep- tions are (a) a smaller p indicates a stronger relationship and (b) statistical signifi- cance indicates practical importance. The second objective was to determine how this problem and the misconceptions were treated in 12 recent textbooks used in edu- cation research methods and statistics classes. The third objective was to examine how the textbooks’ presentations relate to current best practices and how much help they provide for students. The results show that almost all of the textbooks fail to acknowledge that there is controversy surrounding NHST. Most of the textbooks dealt, at least minimally, with the alleged misconceptions of interest, but they pro- vided relatively little help for students. Key words: effect size, NHST, practical importance, research and statistics textbooks
THERE HAS BEEN AN INCREASE in resistance to null hypothesis signifi-
cance testing (NHST) in the social sciences during recent years. The intensity ofthese objections to NHST has increased, especially within the disciplines of psy-chology (Cohen, 1990, 1994; Schmidt, 1996) and education (Robinson & Levin,1997; Thompson, 1996). In response to a recent survey of American Education-al Research Association (AERA) members’ perceptions of statistical significancetests and other statistical issues published in Educational Researcher, Mittag and
Address correspondence to Jeffrey A. Gliner, 206 Occupational Therapy Building,Colorado State University, Fort Collins, CO 80523-1573. E-mail: Gliner@cahs. colostate.eduThe Journal of Experimental Education
Thompson (2000) concluded that “Further movement of the field as regards theuse of statistical tests may require elaboration of more informed editorial poli-cies” (p. 19).
The American Psychological Association (APA) Task Force on Statistical
Inference (Wilkinson & the APA Task Force on Statistical Inference, 1999) ini-tially considered suggesting a ban on the use of NHST, but decided not to, stat-ing instead, “Always provide some effect size estimate when reporting a p value”(p. 399). The new APA (2001) publication manual states, “The general principleto be followed . . . is to provide the reader not only with information about sta-tistical significance but also with enough information to assess the magnitude ofthe observed effect or relationship” (p. 26).
Although informed editorial policies are one key method to increase aware-
ness of changes in data analysis practices, another important practice concernsthe education of students through the texts that are used in research methods andstatistics classes. Such texts are the focus of this article.
We have three objectives in this article. First, we address the major problem
involved with NHST and two common misconceptions related to NHST thatcause confusion for students and researchers (Cohen, 1994; Kirk, 1996; Nicker-son, 2000). These two misconceptions are (a) that the size of the p value indicatesthe strength of the relationship and (b) that statistical significance implies theo-retical or practical significance. Second, we determine how this problem and thesetwo misconceptions are treated in textbooks used in education research methodsand statistics classes. Finally, we examine how these textbook presentations relateto current best practices and how much help they provide for students.
Kirk (1996) had major criticisms of NHST. According to Kirk, the procedure
does not tell researchers what they want to know:
In scientific inference, what we want to know is the probability that the null hypoth-esis (H
0) is true given that we have obtained a set of data (D); that is, p(H0 D). What
null hypothesis significance testing tells us is the probability of obtaining these dataor more extreme data if the null hypothesis is true, p(D|H0). (p. 747)
Kirk (1996) went on to explain that NHST was a trivial exercise because the
null hypothesis is always false, and rejecting it is merely a matter of havingenough power. In this study, we investigated how textbooks treated this majorproblem of NHST.
Current best practice in this area is open to debate (e.g., see Harlow, Mulaik,
& Steiger, 1997). A number of prominent researchers advocate the use of con-fidence intervals in place of NHST on grounds that, for the most part, confi-dence intervals provide more information than a significance test and stillinclude information necessary to determine statistical significance (Cohen,
1994; Kirk, 1996). For those who advocate the use of NHST, the null hypothe-sis of no difference (nil hypothesis) should be replaced by a null hypothesisspecifying some nonzero value based on previous research (Cohen, 1994;Mulaik, Raju, & Harshman, 1997). Thus, there would be less chance that a triv-ial difference between intervention and control groups would result in a rejec-tion of the null hypothesis. The Size of the p Value Indicates the Strength of the Treatment
Outcomes with lower p values are sometimes interpreted by students as hav-
ing stronger treatment effects than those with higher p values; for example, anoutcome of p < .01 is interpreted as having a stronger treatment effect than anoutcome of p < .05. The p value indicates the probability that the outcome couldhappen, assuming a true null hypothesis. It does not indicate the strength of therelationship because although p values do not provide information about the sizeor strength of the effect, smaller p values, given a constant sample size, are cor-related with larger effect sizes. This fact may contribute to the misconception thatthis article is designed to clarify.
How prevalent is this misinterpretation? Oakes (1986) suggested,
It is difficult, however, to estimate the extent of this abuse because the identificationof statistical significance with substantive significance is usually implicit rather thanexplicit. Furthermore, even when an author makes no claim as to an effect sizeunderlying a significant statistic, the reader can hardly avoid making an implicitjudgment as to that effect size. (p. 86)
Oakes found that researchers in psychology grossly overestimate the size of
the effect based on a significance level change from .05 to .01. On the other hand,in the AERA survey provided by Mittag and Thompson (2000), respondentsstrongly disagreed with the statement that p values directly measure study effectsize. One explanation for the difference between the two studies is that the Mit-tag and Thompson (2000) survey question asked for a weighting of agreementwith a statement on a 1–5 scale, whereas Oakes embedded his question in a morecomplex problem.
The current best practice is to report the effect size (i.e., the strength of the
relationship between the independent variable and the dependent variable). How-ever, Robinson and Levin (1997) and Levin and Robinson (2000) brought up twoissues related to the reporting of effect size. Is it most appropriate to use effectsizes, confidence intervals, or both? We agree with Kirk (1996), who suggestedthat when the measurement is in meaningful units, a confidence interval shouldbe used. However, when the measurement is in unfamiliar units, effect sizesshould be reported. Currently there is a move to construct confidence intervalsaround effect sizes (Steiger & Fouladi, 1997; Special Section of Educational andPsychological Measurement, 61(4), 2001). Computing these confidence intervals
The Journal of Experimental Education
involves use of a noncentral distribution that can be addressed with proper sta-tistical software (see Cumming & Finch, 2001).
Should effect size information accompany only statistically significant out-
comes? This is the second issue introduced by Robinson and Levin (1997) andLevin and Robinson (2000). The APA Task Force on Statistical Inference(Wilkinson et al., 1999) recommended always presenting effect sizes for prima-ry outcomes. The Task Force further stated that “reporting effect sizes alsoinforms power analyses and meta-analyses needed in future research” (p. 599). On the other hand, Levin and Robinson (2000) were adamant about not present-ing effect sizes after nonsignificant outcomes. They noted a number of instancesof single-study investigations in which educational researchers have interpretedeffect sizes in the absence of statistically significant outcomes. Our opinion isthat effect sizes should accompany all reported p values for possible future meta-analytic use, but they should not be presented as findings in a single study in theabsence of statistical significance.
Statistical Significance Implies Theoretical or Practical Significance
A common misuse of NHST is the implication that statistical significance
means theoretical or practical significance. This misconception involves interpret-ing a statistically significant difference as a difference that has practical or clini-cal implications. Although there is nothing in the definition of statistical signifi-cance indicating that a significant finding is practically important, such a findingmay be of sufficient magnitude to be judged to have practical significance.
Some recommendations to facilitate the proper interpretation of practical
importance include Thompson’s (1996) suggestion that the term “significant” bereplaced by the phrase “statistically significant” to describe results that reject thenull hypothesis and to distinguish them from practical significance or impor-tance. The AERA members survey (Mittag & Thompson, 2000) strongly agreedwith this statement.
Kirk (1996) suggested reporting confidence intervals about a mean for familiar
measures and reporting effect sizes for unfamiliar measures. However, as moreresearchers advocate the reporting of effect sizes to accompany statistically sig-nificant outcomes, we caution that effect size is not necessarily synonymous withpractical significance. For example, a treatment could have a large effect sizeaccording to Cohen’s (1988) guidelines and yet have little practical importance(e.g., because of the cost of implementation). On the other hand, Rosnow andRosenthal (1996) studied aspirin’s effect on heart attacks. They demonstrated thatthose who took aspirin had a statistically significant lower probability of having aheart attack than those in the placebo condition, but the effect size was only φ =.034. One might argue that phi is not the best measure of effect size here becausewhen the split on one dichotomous variable is extreme compared with the other
dichotomous variable, the size of phi is constricted (Lipsey & Wilson, 2001). However, the odds-ratio from these data was only 1.8, which is not consideredstrong (Kraemer, 1992). The point here is that one can have a small effect size thatis practically important, and vice versa. Although this effect size is considered tobe small, the practical importance was high, because of both the low cost of tak-ing aspirin and the importance of reducing myocardial infarction. Cohen empha-sized that context matters and that his guidelines (e.g., d = 0.8 is large) were arbi-trary. Thus, what is a large effect in one context or study may be small in another.
Perhaps the biggest problem associated with the practical significance issue is the
lack of good measures. Cohen (1994) pointed out that researchers probably werenot reporting confidence intervals because they were so large. He went on to say,“their sheer size should move us toward improving our measurement by seeking toreduce the unreliable and invalid part of the variance in our measures” (p. 1002).
Six textbooks used in graduate-level research classes in education and six text-
books used in graduate-level statistics classes in education were selected for thisstudy. We tried to select a diverse set of popular, commonly used textbooks, almostall of which were in at least the second edition. We consulted with colleagues at arange of universities (from comprehensive research universities to those specializ-ing in teacher training to smaller, private institutions) about the textbooks theyused, and we included these books in our sample. The statistics textbooks eitherreferred to education in the title or the author was in a school of education; theycovered at a minimum through analysis of variance (ANOVA) and multiple regres-sion. The textbooks used for this study are listed in the references and are identi-fied with one asterisk for research books and two asterisks for statistics books.
We made judgments about the textbooks for each of the issues and examined
all the relevant passages for each topic. Each author independently rated twothirds of the textbooks, yielding two raters per textbook. Table 1 shows the rat-ing system with criteria for points and an example of how the criteria were usedfor one of the issues.
Table 2 shows the interrater reliability among the three judges. Although
exact agreement within an issue was quite variable, from a high of 100% to alow of 42%, there was much less variability (from 92% to 100% agreement)among raters for close agreement (i.e., ± 1 point). The strongest agreement wasfor issue 3, which posits that statistical significance does not mean practicalimportance, on which there was 100% agreement for all texts. This issue alsohad the highest average rating among the three (see Table 3), indicating that theissue was typically presented under a separate heading so that it was easy for theraters to find and evaluate. If the raters disagreed, they met and came to a con-sensus score, which was used for Table 3. The Journal of Experimental EducationTABLE 1 The Criteria and Examples of How the Ratings Were Used
Example (from statistical vs. practical performance)
No mention of the issue of practical vs. statistical significance.
This book discussed briefly whether a difference was real. No specific information included in the index or text about practical importance.
about this issue, no headings contrasting
statistical and practical significance, and nothing
in the index about this issue. Thus, the relativelyisolated statements could easily be missed.
Although these books had several statements
such as “results can be statistically significant
without being important,” they were usually an
isolated point in a broader section (e.g., on the
level of significance). The examples (e.g.,about sample size and correlation) did not provide much help in terms of deciding what is practically important.
In these books, there was discussion of the issue in
the sections on several or all of the major statistics.
There were headings such as “statistical versus
practical significance.” In addition to repeated
statements that not all statistically significantresults have practical importance, there werehelpful examples.
TABLE 2 Interrater Reliability p does not indicate the strength of the relationship
Statistical significance does not mean practical importance
Table 3 shows the percentage of books that covered each of the topics at a rat-
ing of at least 2 (direct but brief statement) and the average rating for each of theissues.
TABLE 3 Percentage of Texts (Rating 2 or More) and Average Rating for Each of the Three Issues
The issue of null hypothesis testing had the lowest average overall rating
(0.58) and was covered at a level of 2 or more in only 17% (2 out of 12) of thetexts. It was rarely addressed in any of the research texts, and only mentionedindirectly then. At least 1 text, however, made an effort to address NHST, citingthe following example:
Fisher opposed the idea of an alternative hypothesis. This was the creation of . . . (Neyman & Pearson) . . . whose views Fisher vehemently opposed . . . nevertheless,it became standard practice that when rejecting the null hypothesis, one accepts thealternative. It must be emphasized, however, that a p-value does not give the proba-bility of either the null or the alternative hypothesis as being true. (Minium, King, &Bear, 1993, p. 294)
This quote was found in a large box titled “Point of Controversy—DichotomousHypothesis-Testing Decisions.”
The Size of the p Value Indicates the Strength of the Treatment
This issue also had a low average overall rating (1.58) but was stated directly
(i.e., covered at a level of 2 or more) in two thirds of the texts. There appeared tobe no difference in the percentage of texts or depth of coverage between researchand statistics texts. When covered at a level of 3 or 4, a typical excerpt is as fol-lows: “Recommendation 14: Do not use tests of statistical significance to evaluatethe magnitude of a relationship” (Fraenkel & Wallen, 2000, p. 273). The recom-mendations are indented and in italics, making them obvious and easy for the stu-dent to see and realize that it was important. Some of the texts mentioned the useof effect size measures to indicate the strength of the relationship, but many did not. The Journal of Experimental EducationStatistical Significance Versus Practical Significance
The issue of practical significance had the most coverage (84% of textbooks
had direct statements) and the highest average overall rating (3.25). Most often,this issue was covered with more than a brief statement, with emphasis, and wasfrequently presented under a separate heading. Typical statements for this issuewere “Just because a result is statistically significant (not due to chance) does notmean that it has any practical or educational value in the real world in which weall work and live” (Fraenkel & Wallen, 2000, p. 254). Or, “Remember that a‘reject’ decision, by itself, is not indicative of a useful finding” (Huck, 2000, p. 199). However, we judged the discussions that followed these and similar state-ments to be less helpful on the basis of an examination of the examples and thecontext in which the statement was included. It was clear that statistical signifi-cance is not the same as practical significance or importance, but it was usuallyless clear how to know whether a result has practical importance. Discussion
Our interpretation is similar to that of Mittag and Thompson (2000), who
noted “Our results contain some heartening and disheartening findings” (p. 19). Most disheartening was the failure of almost all of these recent texts to acknowl-edge that there is controversy surrounding NHST. Although many of the textsprovided detailed information on confidence intervals and effect sizes, few relat-ed this information to hypothesis testing. This was especially true for effect sizes;many textbooks discussed effect sizes only in the contexts of computing poweror of meta-analysis. On the positive side, most of the texts dealt, at least mini-mally, with the two misconceptions of interest for our article. A few of these text-books went into detail and provided recommendations similar to those suggest-ed by Kirk (1996).
Why was there a discrepancy between the many articles acknowledging prob-
lems with NHST and the failure to recognize these problems by these research andstatistics textbooks? We suggest three explanations for this apparent discrepancy. The first concerns textbook revisions. Most of these texts, especially the researchtexts, were in their third to sixth edition, and they had originally been publishedbefore 1990 when NHST was less controversial. Our speculation is that theauthors of research textbooks in which we found little or no statement of theNHST controversy, focused their revisions on updating the content literatureabout the studies and methods they cited. They may not have updated the statis-tics chapters much, perhaps assuming that statistics do not change. In addition,adding information about how to compute and report effect size and confidenceintervals would change too many sections of the text. Authors have to be practi-cal, considering publishing company deadlines and competition from other newertexts. Thus, textbook revisions in this area have generally been limited.
A second possible explanation for the failure to include NHST issues in text-
books concerns the level of depth, difficulty of concepts, and students’ priorknowledge. The textbooks that we reviewed were intended for master’s- or doc-toral-level students in education, often as a first course in research or statistics orone taken many years after having an earlier such course. The logic of hypothe-sis testing is relatively difficult to understand, especially if students are not famil-iar with research design and statistics.
Our third possible explanation relates to best practice. Although there is gen-
eral acknowledgement that each of the topics we explored should be covered inresearch and statistics textbooks, there is not general agreement about how itshould be covered. This is especially the case with regard to how to decidewhether a statistically significant finding has practical importance. There is alsocontroversy about best practice for hypothesis significance testing (Harlow,Mulaik, & Steiger, 1997) and effect size reporting (Levin & Robinson, 2000;Robinson & Levin, 1997). Textbook authors (and publishers) are often reluctantto put best practices in print that may be changed in the next few years.
We have three recommendations to those who are in the process of writing or
revising a research or statistics text. First, consider introducing a section on theNHST controversy. This section would, at the minimum, point out that there iscurrently debate about whether NHST is the best method for advancing researchin the fields of education and the social sciences. Second, because the fifth edi-tion of the APA (2001) publication manual includes information on the impor-tance of reporting effect sizes and confidence intervals, an author should providespecific examples (as might be published in a journal) of what to do following astatistically significant outcome. We also recommend reporting effect size fornonsignificant outcomes. Third, provide more help (with examples) for decidingwhether a result has practical significance or importance. REFERENCES
American Psychological Association (APA). (2001). Publication manual of the American Psycho-logical Association (5th ed.). Washington, DC: Author.
*Ary, D., Jacobs, L. C., & Razavieh, A. (1996). Introduction to research in education (5th ed.). Fort
Worth, TX: Holt, Rinehart and Winston.
**Cobb, G. W. (1998). Design and analysis of experiments. New York: Springer-Verlag. Cohen, J. (1988). Power analysis for the behavioral sciences. Hillsdale, NJ: Erlbaum. Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 1304–1312. Cohen, J. (1994). The world is round (p < .05). American Psychologist, 49, 997–1003. Cumming, G., & Finch, S. (2001). A primer on the understanding, use, and calculation of confidence
intervals that are based on central and noncentral distributions. Educational and PsychologicalMeasurement, 61, 532–574.
*Fraenkel, J. R., & Wallen, N. E. (2000). How to design and evaluate research in education (4th ed.).
*Gall, J. P., Gall, M. D., & Borg, W. (1999). Applying educational research: A practical guide (4th
ed.). New York: Addison Wesley Longman.
*Gay, L. R., & Airasian, P. (2000). Educational research: Competencies for analysis and applicationThe Journal of Experimental Education
**Glass, G. V, & Hopkins, K. D. (1996). Statistical methods in education and psychology. Boston:
Harlow, L. L., Mulaik, S. A., & Steiger, J. H. (Eds.). (1997). What if there were no significance tests?
**Hinkle, D. E., Wiersma, W., & Jurs, S. G. (1998). Applied statistics for the behavioral sciences (4th
**Huck, S. W. (2000). Reading statistics and research (3rd ed.). New York: Addison Wesley Long-
Kirk, R. E. (1996). Practical significance: A concept whose time has come. Educational and Psy-chological Measurement, 56, 746–759.
Kraemer, H. C. (1992). Evaluating medical tests. Newbury Park, CA: Sage. *Krathwohl, D. R. (1998). Educational and social science research: An integrated approach (2nd
ed.). New York: Addison Wesley Longman.
Levin, J. R., & Robinson, D. H. (2000). Rejoinder: Statistical hypothesis testing, effect-size estima-
tion, and the conclusion coherence of primary research studies. Educational Researcher, 29,34–36.
Lipsey, M. W., & Wilson, D. B. (2001). Practical meta-analysis. Thousand Oaks, CA: Sage. *McMillan, J. H., & Schumacher, S. (1997). Research in education: A conceptual introduction (4th
ed.). New York: Addison Wesley Longman.
**Minium, E. W., King, B. M., & Bear, G. (1993). Statistical reasoning in psychology and education
Mittag, K. C., & Thompson, B. (2000). A national survey of AERA members’ perceptions of statis-
tical significance tests and other statistical issues. Educational Researcher, 29, 14–20.
Mulaik, S. A., Raju, N. S., & Harshman, R. A. (1997). There is a time and place for significance test-
ing. In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.). What if there were no significance tests?(pp. 65–116). Mahwah, NJ: Erlbaum.
Nickerson, R. S. (2000). Null hypothesis significance testing: A review of an old and continuing con-
troversy. Psychological Methods, 5, 241–301.
Oakes, M. (1986). Statistical inference: A commentary for the social and behavioral sciences. New York: Wiley. Robinson, D. H., & Levin, J. R. (1997). Reflections on statistical and substantive significance, with
a slice of replication. Educational Researcher, 26, 21–26.
Rosnow, R. L., & Rosenthal, R. (1996). Beginning behavioral research (2nd ed.). Englewood Cliffs,
Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology:
Implications for training of researchers. Psychological Methods, 1, 115–129.
**Shavelson, R. J. (1996). Statistical reasoning for the behavioral sciences (3rd ed.). Needham
Steiger, J. H., & Fouladi, R.T. (1997). Noncentrality interval estimation and the evaluation of statis-
tical models. In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.). What if there were no signifi-cance tests? (pp. 221–258). Mahwah, NJ: Erlbaum.
Thompson, B. (1996). AERA editorial policies regarding statistical significance testing: Three sug-
gested reforms. Educational Researcher, 25, 26–30.
Wilkinson, L., & the APA Task Force on Statistical Inference. (1999). Statistical methods in psy-
chology journals: Guidelines and explanations. American Psychologist, 54, 594–604.
*Textbooks used in educational research classes. **Textbooks used in educational statistics classes.
November 12, 2013 Grant Zeng, CFA Small-Cap Research 312-265-9466 email@example.com scr.zacks.com 111 North Canal Street, Chicago, IL 60606 Soligenix Inc. (SNGX-OTCBB) SNGX: On track to advance multiple clinical SNGX is a development stage biopharmaceutical programs, Balance sheet remains strong---company focused on cancer supportive care, GI disorders an
MAHMOUD ABDEL-RAHMAN SROUR in Experimental Hematology & Molecular Biology Work address Private address Personal Information Education PhD degree in Molecular Cell Biology (major) and Experimental Hematology (minor), with Very good-Plus ( 0.7 out of 5.0); Institute for Cell Biology- Faculty of Science and Mathematics and Institute for Experimental Haematolo