Calculating Confidence Intervals for the
Number Needed to Treat
Ralf Bender, PhD
Department of Epidemiology and Medical Statistics, School of Public Health, University of
Bielefeld, Bielefeld, Germany
The number needed to treat (NNT) has gained much attention in the past years
as a useful way of reporting the results of randomized controlled trials with a binaryoutcome. Defined as the reciprocal of the absolute risk reduction (ARR), NNT is theestimated average number of patients needed to be treated to prevent an adverseoutcome in one additional patient. As with other estimated effect measures, it is impor-tant to document the uncertainty of the estimation by means of an appropriate confidenceinterval. Confidence intervals for NNT can be obtained by inverting and exchangingthe confidence limits for the ARR provided that the NNT scale ranging from 1 through
∞ to Ϫ1 is taken into account. Unfortunately, the only method used in practice tocalculate confidence intervals for ARR seems to be the simple Wald method, whichyields too short confidence intervals in many cases. In this paper it is shown that theapplication of the Wilson score method improves the calculation and presentation ofconfidence intervals for the number needed to treat. Control Clin Trials
2001;22:102–110 Elsevier Science Inc. 2001
KEY WORDS: Absolute risk reduction, confidence interval, evidence-based medicine, equivalence, number
The number needed to treat (NNT) has gained much attention in the past
years as a useful way of reporting the results of randomized controlled trialswith a binary outcome [1–3]. Defined as the reciprocal of the absolute riskreduction (ARR), the number needed to treat is the estimated average numberof patients needed to be treated to prevent an adverse outcome in one additionalpatient. A negative NNT is the estimated average number of patients neededto be treated with the new rather than the standard treatment for one additionalpatient to be harmed. While this measure is often better understood than riskratios or risk reductions by clinicians and patients, the NNT has undesirablemathematical and statistical properties. The understanding of the confidenceinterval for NNT is not straightforward. However, an excellent explanation
Address reprint requests to: Ralf Bender, PhD, Department of Epidemiology and Medical Statistics, School of
Public Health, University of Bielefeld, P.O. Box 100131, D-33501 Bielefeld, Germany (firstname.lastname@example.org)
Received June 27, 2000; accepted November 28, 2000.
Controlled Clinical Trials 22:102–110 (2001)
655 Avenue of the Americas, New York, NY 10010
was recently given by Altman . The mathematical and statistical propertiesof the NNT statistic are described in more detail by Lesaffre and Pledger .
The key to understanding the confidence interval for NNT is that principally
the domain of NNT is the union of 1 to ∞ and Ϫ∞ to Ϫ1. The best value ofNNT indicating the largest possible beneficial treatment effect is 1, the NNTvalue indicating no treatment effect (ARR ϭ 0) is Ϯ∞, and the worst NNTvalue indicating the largest possible harmful effect is Ϫ1. Thus, the resultNNT ϭ 10 with confidence limits 4 and Ϫ20 means that the two regions 4 to
∞ and Ϫ20 to Ϫ∞ form the confidence interval. Altman proposed to use twonew abbreviations, namely number needed to treat for one patient to benefit(NNTB) or be harmed (NNTH) . This concept avoids the awkward term“number needed to harm” (NNH), which is used, for example, in the journalEvidence-Based Medicine.
The result of an estimated NNT with confidence inter-val can then be presented as NNTB ϭ 10 (NNTB 4 to ∞ to NNTH 20) .
Altman recommended that a confidence interval should always be given
when an NNT is reported as a study result . However, the usual Waldmethod for calculating such confidence intervals is frequently inappropriate.
By using examples from the literature and artificial examples, it is shown thatthe application of the Wilson score method  improves the calculation andpresentation of confidence intervals for the number needed to treat.
METHODS TO CALCULATE CONFIDENCE INTERVALS FOR NNT
Let 1 and 2 be the true probabilities (risks) of an adverse event in the
control group (group 1) and the treatment group (group 2), respectively. Thetrue ARR is the difference of the two risks Ϫ
2) of the true ARR. To estimate these measures a randomized
clinical trial can be performed. Let n
1 and n
2 be the number of patients random-ized in the control group and the treatment group, respectively, and let e
2 be the number of patients having an event in the control group and thetreatment group, respectively. The two risks can then be estimated by theproportions p
2. The true effect measures can be estimated
Under regularity conditions (continuity, one-to-one transformation) a confi-
dence interval for NNT can be obtained by inverting and exchanging theconfidence limits for ARR . Let LL(ARR) and UL(ARR) be the lower andupper confidence limits for ARR, then the confidence interval for NNT can beexpressed as [1/UL(ARR),1/LL(ARR)]. However, it should be recognized thatthe continuity condition is violated for the reciprocal transformation if theconfidence interval for ARR encloses 0. In this case, the confidence interval forNNT is the union (Ϫ∞,1/LL(ARR)]ʜ[1/UL(ARR),∞) of two half intervals [4,5]. One possibility to take the violation of the continuity condition into accountis Altman’s suggestion to write the confidence interval for an estimated positiveNNT value as “NNTB 1/UL(ARR) to ∞ to NNTH 1/LL(ARR)” . Thus,confidence limits for NNT can be calculated from confidence limits for ARRin all cases. Hence, we concentrate on the interval estimation of ARR.
The standard method of calculating confidence intervals for ARR makes use
of the asymptotic normality and the usual formula for the standard error (SE)
of the estimated ARR. Using the notations above the estimated standard errorof p
2) ϭ Ίp
1(1 Ϫ p
1Ϫ␣/2 be the 1 Ϫ ␣/2 quantile of the standard normal distribution. Thesimple Wald-type 100 ϫ (1 Ϫ ␣)% confidence interval for ARR is then given by:
While Wald confidence intervals are adequate for large sample sizes and
probabilities not close to 0 or 1, they have poor coverage characteristics and apropensity to aberrations in many practical situations. Especially in small sam-ples, unbalanced designs, and probabilities close to 0 or 1, the Wald methodleads to unreliable or even theoretically impossible results. This is well knownin the statistical literature [6, 8–10] and was also noted in the medical literatureseveral years ago . However, up to now, confidence intervals for NNT—ifat all—are calculated by applying the simple Wald method [4, 7, 12]. Thereare a number of better methods that can be used instead of the simple Waldmethod [6, 8–10]. However, some of these methods require complex computa-tions. Buchan  proposed to use exact confidence intervals, which are nowprovided by StatXact . However, exact methods for interval estimationof proportions are conservative, i.e., they yield confidence intervals that areunnecessarily wide .
It has been shown that confidence intervals based upon Wilson scores have
coverage probabilities close to the nominal confidence level [6, 14–16]. More-over, they are easier to calculate than exact confidence intervals. Hence, afterinvestigating 11 methods for interval estimation of ARR, Newcombe proposedto use the Wilson score method . The 100 ϫ (1 Ϫ ␣)% confidence intervalfor ARR based upon Wilson scores is given by:
The corresponding approximate confidence limits for NNT can then be calcu-lated by LL(NNT) ϭ 1/UL(ARR) and UL(NNT) ϭ 1/LL(ARR) in considerationof the NNT scale ranging from 1 through ∞ to Ϫ1 (see above). For calculations,a SAS/IML  program can be used that is available via the internet <<http://www.uni-bielefeld.de/~rbender/SOFTWARE/nnt_ci.sas>> or from the au-thor on request.
SHORTCOMINGS OF THE SIMPLE WALD METHOD
Principally, the shortcomings of the Wald confidence intervals transmit from
ARR to NNT. However, for interpretation the NNT scale has to be taken into
account. In the following the confidence intervals for NNT based on Wilsonscores are compared with the Wald confidence intervals by means of publishedand artificial examples. The published examples are estimated NNT valuesfound in the journal Evidence-Based Medicine
[18–21]. Here, we concentrate onthe comparison of the confidence intervals and do not discuss the clinicalbackground of the studies. The adequacy of the Wald confidence intervals ismainly dependent on the sample size and the distance of the risks from theextreme points 0 and 1. Nevertheless, in the following the properties of theWald confidence intervals for NNT are described with reference to the samplesize and the size of the NNT value, because this information is mostly givenin articles whereas the risks themselves are frequently missing.
In Table 1, example 1 shows that the Wald method could be used if NNT
is low (say, NNTB Ͻ 10) and the sample size is moderate (n
Ͼ 100). Note thatNNT values below 10 correspond to ARR values above 0.1, which are onlypossible if at least one of the estimated risks is larger than 10%. It can beexpected that the Wald confidence intervals are inadequate if both risk estimatesare close to 0, which results in higher NNT values. For very high NNT values(say, NNTH Ͼ 100), the sample size has to be extremely large (n
Ͼ 10,000) toget reliable confidence limits by means of the Wald method (example 2). Forhigh NNT estimates (say, NNTB Ͼ 10), the Wald method is insufficient in thecase of moderate sample size (n
Ͼ 100, example 3) but improves markedly forlarge sample sizes (n
Ͼ 1000, example 4), although the Wald confidence intervalis still too short in this case. These examples are based on published resultsdemonstrating that the application of the Wald method may lead to inappropri-ate confidence intervals in situations occurring in practice.
The deficiencies of the simple Wald method are pointed out more clearly
by means of artificial examples. For high NNT estimates especially the upperWald confidence limit is unreliable, even for moderate sample sizes (artificialexample 1). In most cases, the upper Wald confidence limit will be too low.
However, in the case of quite different sample sizes between the two groups,the opposite may be true. In artificial example 2 the Wald UL of 486 is muchlarger than the UL of 64 calculated by means of the Wilson score method. Atfirst sight, the Wald confidence interval seems to be wider than the Wilsonconfidence interval. However, this is true only in the NNT scale. In the ARRscale the Wald confidence interval is shorter than the Wilson confidence intervaland therefore inadequate. The magnitude of the difference between the Waldand Wilson lower confidence limits (|11.4 Ϫ 9.4| ϭ 2 in NNT scale, but |0.088 Ϫ0.107| ϭ 0.019 in ARR scale) is larger in the ARR scale than between the upperlimits (|486 Ϫ 64| ϭ 422 in NNT scale, but |0.002 Ϫ 0.016| ϭ 0.014 in ARRscale). This makes it difficult to interpret small and large NNT values. On onehand, there is no substantial difference between large NNT values, say betweenNNT ϭ 1000 and NNT ϭ 5000. In terms of probabilities this is only a differenceof 0.001 Ϫ 0.0002 ϭ 0.0008. On the other hand, for public health decisions itmay be important to treat 1000 or 5000 patients to prevent one death. However,whether relying on the NNT or the ARR scale, both Wald confidence limitsare unreliable in the case of small risks and a highly unbalanced design.
The Wald method leads to several aberrations. NNT estimates close to 1
and low sample size can lead to a theoretically impossible lower Wald confi-dence limit (artificial example 3). If the ARR estimate is exactly 1, no meaningful
Wald confidence interval can be calculated because the standard error of ARRis erroneously 0 (artificial example 4). The same holds when both risk estimatesare exactly 0 (artificial example 5).
USING NNT FOR EQUIVALENCE TRIALS
The possible aberrations of the simple Wald method to calculate confidence
intervals for ARR and NNT are meaningful especially for equivalence trials .
To demonstrate equivalence in therapeutic clinical trials the use of confidenceintervals with coverage probability of 95% or more is recommended . Fre-quently, the objective of a study is to show that the new treatment is notinferior to the standard treatment. In such trials, one possibility to demonstrateequivalence between treatments at one-sided significance level ␣ is to showthat the value of the 100 ϫ (1 Ϫ 2␣)% confidence limit corresponding to thedeterioration of the effect is better than a predefined acceptable difference.
As clinicians argue more and more in terms of NNT, it seems logical to use
NNT also as an effect measure in equivalence trials. If NNT is better understoodthan, for example, the odds ratio, it should be easier to define an appropriateacceptable difference for NNT than for the odds ratio. For example, if the riskof the standard treatment group is expected to be 5%, a possible acceptabledifference for NNT could be the value NNTH ϭ 100. Thus, it is defined thatthe new treatment is not inferior to the standard treatment if 100 or morepatients are needed to be treated for one additional patient to be harmed. Todemonstrate one-sided equivalence between the new and the standard treat-ment the upper confidence limit for NNT must be larger than NNTH ϭ 100or must lie within the range of NNTB 1 to ∞. The latter would mean that thenew treatment is even superior to the standard treatment.
In artificial example 6 the upper 95% Wald confidence limit of NNTH ϭ
144 would lead to the decision of equivalence. This decision, however, is ques-tionable because the Wald confidence interval is probably too short, as is shownby the Wilson score confidence interval of NNTB 10 to ∞ to NNTH 78. Thismeans that there may be up to 1 of 78 treated patients who is harmed insteadof 1 of 100 treated patients. Thus, the upper confidence limit is beyond theacceptable difference of NNTH ϭ 100. If NNT is used as an effect measure inequivalence trials, the usual Wald confidence intervals for ARR and NNTshould not be applied even in the case of moderate sample sizes. The decisionthat two treatments are equivalent with regard to NNT should be based uponappropriate confidence limits to ensure adequate decisions.
DISCUSSION AND CONCLUSION
NNT has become a popular summary statistic to describe the absolute effect
of a given treatment in comparison to a standard treatment or control. It wasfirst introduced for use in randomized placebo-controlled clinical trials ,then adopted as the primary outcome measure for systematic reviews such asmeta-analyses , extended to the statistic “number needed to screen” tocompare strategies for disease screening , and is now applied also in epide-miology to express the magnitude of adverse effects in case-control studies. NNTs are popular among clinicians because at first sight they are easier
to understand than odds ratios or even ARRs. However, there are differentopinions about what is easy to understand. Some authors still prefer to useARR rather than NNT [28–31]. In my opinion, ARR and NNT contain equivalentinformation. In trials with a beneficial effect of the treatment, ARR expressesthis effect in terms of numbers of additional adverse events prevented per 100people treated (if ARR is presented in percentages), while NNT is the numberof people needed to be treated to prevent one additional adverse event. Bothmeasures can be applied; however, to use and interpret them adequately, theunderlying scale has to be understood. As NNTs are used more and morein biomedical research, it is apparent that appropriate methods to calculateconfidence intervals for NNTs are required.
In the current medical literature the calculation and reporting of confidence
intervals for NNT is quite unsatisfactory. A systematic search through all issuesof the journal Evidence-Based Medicine
(1995–1999) revealed that confidenceintervals for estimated NNTs are given only for significant results. If confidenceintervals are reported, the method used for calculation is frequently unclear.
One reason for this is that a definition of NNT is given only for the simplesituation of a randomized clinical trial comparing a new with a standard treat-ment concerning a binary outcome over a fixed follow-up time. However, inpractice NNT values are also calculated for trials with variable follow-up times.
For example, in the UK Prospective Diabetes Study (UKPDS), NNTs have beencalculated for the comparison of less tight blood pressure control (controlgroup) and tight blood pressure control (treatment group) . For the outcomeof diabetes-related death, the result NNTB ϭ 15 (95% confidence interval: 12.1to 17.9) was obtained . However, 1 year later, for the same data, NNTB ϭ20 (95% confidence interval: 10 to 100) was calculated . Such different resultsconcerning NNT estimation and confidence intervals are probably due to theapplication of unclear and questionable ad hoc methods in studies with varyingfollow-up times. To estimate NNT with appropriate confidence intervals intrials in which the outcome is time to an event rather than a simple binaryvariable, more complex methods are required .
In trials with fixed follow-up time and a binary outcome, the only method
routinely used in practice seems to be the inverting and exchanging of thesimple Wald confidence limits for ARR. This procedure, however, leads tounreliable confidence intervals for NNT in many cases, especially in studieswith low sample sizes, risks close to 0 or 1, unbalanced designs, and equivalencetrials. In situations in which it is particularly important to quantify the uncer-tainty of estimations, the usual Wald method fails. The application of the Wilsonscore method leads to confidence intervals for NNT that have much bettercoverage properties, are free of aberrations, and are much easier to calculatethan exact confidence intervals. Any estimated NNT should be complementedby an adequate confidence interval and the calculation method should be stated.
For interval estimation of NNTs in trials with fixed follow-up times and binaryoutcomes I recommend the replacement of the usual Wald method with theWilson score method .
I thank Robert G. Newcombe for his valuable and helpful comments, which improved the
1. Cook RJ, Sackett DL. The number needed to treat: A clinically useful measure of
treatment effect. BMJ
2. Sackett DL. On some clinically useful measures of the effects of treatment. Evidence-
3. Chatellier G, Zapletal E, Lemaitre D, Me´nard J, Degoulet P. The number needed to
treat: A clinically useful nomogram in its proper context. BMJ
4. Altman DG. Confidence intervals for the number needed to treat. BMJ
5. Lesaffre E, Pledger G. A note on the number needed to treat. Control Clin Trials
6. Newcombe RG. Interval estimation for the difference between independent propor-
tions: Comparison of eleven methods. Stat Med
7. Daly LE. Confidence limits made easy: Interval estimation using a substitution
method. Am J Epidemiol
8. Miettinen OS, Nurminen M. Comparative analysis of two rates. Stat Med
9. Beal SL. Asymptotic confidence intervals for the difference between binomial param-
eters for the use with small samples. Biometrics
10. Wallenstein S. A non-iterative accurate asymptotic confidence interval for the differ-
ence between two proportions. Stat Med
11. Buchan IE. Computer software that can calculate confidence intervals is now avail-
able (letter). BMJ
12. Gardner MJ, Altman DG. Confidence intervals rather than P values: Estimating
rather than hypothesis testing. BMJ
13. Mehta CR, Patel NR. StatXact 4 for Windows. Statistical Software for Exact Nonparametric
Cambridge, MA: CYTEL Software Corporation; 1999.
14. Agresti A, Coull BA. Approximate is better than “exact” for interval estimation of
binomial proportions. Am Statistn
15. Vollset SE. Confidence intervals for a binomial proportion. Stat Med
16. Newcombe RG. Two-sided confidence intervals for the single proportion: Compari-
son of seven methods. Stat Med
17. SAS. SAS/IML User’s Guide, Version 5 Edition.
Cary, NC: SAS Institute Inc.; 1985.
18. Stobart K. Periodic blood transfusions reduced the risk for stroke in children with
sickle-cell anaemia. Evidence-Based Med
19. Walma E, Thomas S. Captopril was not more effective than conventional treatment
in hypertension and led to an increase in stroke. Evidence-Based Med
20. Faris I. Vein-patch closure was better than primary closure in decreasing early
strokes and arterial occlusion in carotid endarterectomy. Evidence-Based Med
21. Woods KL. Pravastatin reduced cardiovascular events in older patients with myocar-
dial infarction and average cholesterol levels. Evidence-Based Med
22. Jones B, Jarvis P, Lewis JA, Ebbutt AF. Trials to assess equivalence: The importance
of rigorous methods. BMJ
23. The CPMP Working Party on Efficacy of Medical Products. Biostatistical methodol-
ogy in clinical trials in applications for marketing authorizations for medical prod-ucts. Stat Med
24. Laupacis A, Sackett DL, Roberts RS. An assessment of clinically useful measures
of the consequences of treatment. N Engl J Med
25. McQuay HJ, Moore A. Using numerical results from systematic reviews in clinical
practice. Ann Intern Med
26. Rembold CM. Number needed to screen: Development of a statistic for disease
27. Bjerre LM, LeLorier L. Expressing the magnitude of adverse effects in case-control
studies: “The number of patients needed to be treated for one additional patient tobe harmed.” BMJ
28. North D. Number needed to treat: Absolute risk reduction may be easier for patients
to understand (letter). BMJ
29. Pickin M, Nicholl J. Number who benefit per unit of treatment may be a more
appropriate measure (letter). BMJ
30. Newcombe RG. Confidence intervals for the number needed to treat: Absolute risk
reduction is less likely to be misunderstood (letter). BMJ
31. Hutton JL. Number needed to treat: Properties and problems. J R Stat Soc A
32. The UK Prospective Diabetes Study (UKPDS) Group. Tight blood pressure control
and risk of macrovascular and microvascular complications in type 2 diabetes:UKPDS 38. BMJ
33. Almbrand B, Malmberg K, Ryden L. Tight blood pressure control reduced diabetes
mellitus-related deaths and complications and was cost-effective in type 2 diabetes.Evidence-Based Med
34. Altman DG, Andersen PK. Calculating the number needed to treat where the out-
come is time to an event. BMJ
American Ginseng Panax quinquefolius Ginseng à cinq folioles f the most widely used medicinal herbs in the world, its production yields nearly $68 million a year in Canada! Properties and Uses Medicinal Industrial Demand Consumed on a regular basis, American ginseng reduces tiredness, relieves stress, improves short-Most of Canada’s ginseng production is expor
P R EV E N T I N G U P P E R GAST R O I N T E ST I N A L B L E E D I N G I N PAT I E N T S W I T H H E L I C O B AC T E R PY LO R I I N F EC T I O N PREVENTING RECURRENT UPPER GASTROINTESTINAL BLEEDING IN PATIENTS WITH HELICOBACTER PYLORI INFECTION WHO ARE TAKING LOW-DOSE ASPIRIN OR NAPROXEN FRANCI