Article Text

Download PDFPDF

Making the best use of available evidence: the case of new generation antidepressants
  1. Corrado Barbui1,
  2. Andrea Cipriani1,
  3. Toshiaki A Furukawa2,
  4. Georgia Salanti3,
  5. Julian P T Higgins4,
  6. Rachel Churchill5,
  7. Norio Watanabe2,
  8. Atsuo Nakagawa6,
  9. Ichiro M Omori7,
  10. John R Geddes8
  1. 1 Department of Medicine and Public Health, Section of Psychiatry and Clinical Psychology, University of Verona, Italy
  2. 2 Department of Psychiatry and Cognitive–Behavioural Medicine, Nagoya City University Graduate School of Medical Sciences, Nagoya, Japan
  3. 3 Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Greece
  4. 4 MRC Biostatistics Unit Institute of Public Health, University of Cambridge, Cambridge, UK
  5. 5 Department of Community Based Medicine and Cochrane Depression, Anxiety and Neurosis Review Group, University of Bristol, Bristol, UK
  6. 6 Department of Neuropsychiatry, School of Medicine, Keio University, Tokyo, Japan
  7. 7 Cochrane Schizophrenia Group, University of Nottingham, Nottingham, UK
  8. 8 Department of Psychiatry, University of Oxford, Oxford, UK
  1. Correspondence to Corrado Barbui, Department of Medicine and Public Health, Section of Psychiatry and Clinical Psychology, University of Verona, Policlinico “GB Rossi”, Piazzale LA Scuro, 10-37134 Verona, Italy; corrado.barbui{at}

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

In this issue of Evidence-Based Mental Health, Gartlehner and Gaynes1 comment (see page 98) on our recently published systematic review2 that investigated the comparative efficacy and acceptability of 12 new generation antidepressants (see page 107). In their view, methodological shortcomings limit the validity of our results and the conclusions reached. In this commentary, our aim is to explain the rationale for doing this systematic review, outline its main findings and address the points raised by Gartlehner and Gaynes.1 Scientific debate can illuminate and clarify complex analyses and we are, therefore, delighted to respond to their critique. While we consider that some of the issues raised are substantive and merit reasoned response, we also believe that some of their criticisms seem rather overstated. We understand that Gartlehner and Gaynes too have published an analysis comparing antidepressants and we note that this is now the third occasion on which they have published similar criticisms of our work.3 ,4

The necessity for doing a multiple treatments meta-analysis on new generation antidepressants

In most countries, demonstration of a difference against placebo, and not against an active comparator, makes a new drug eligible for registration. The European Medicines Agency, for example, is willing to evaluate new antidepressants in the absence of comparison with active existing treatments.5 In situations where no (or a few) active treatments are available, this may not be important. In the field of antidepressants, however, where many potentially effective agents are already available, this process has serious implications. The approval of a new antidepressant as effective and safe in comparison solely with placebo allows the marketing of new drugs that may, in fact, be potentially more effective, similarly effective or even less effective than others currently in use. Many antidepressants have never been directly compared with each other. The picture is further complicated by the fact that available head-to-head antidepressant trials have a number of limitations, including low statistical power, low external validity, outcome reporting bias, sponsorship bias and publication bias.6 ,7 ,8 ,9 ,10 ,11 ,12 The lack of robust and reliable active comparison means that there is a great deal of uncertainty about the place of a new agent in clinical practice. Consequently, clinicians and patients may be daunted and disoriented by a large number of “apparently equally useful” antidepressants and by the lack of evidence based criteria that should be used to guide treatment choices.6 ,7 ,8 ,9 ,10 ,11 ,13 In the face of such uncertainty, treatment choices are inevitably guided by a variety of different criteria, including personal experience, drug availability, acquisition costs, marketing and opinion leaders’ judgments that are often based on a biased representation of the scientific literature.14

In this scenario, we have a choice. We may either make the best use of the available randomised evidence or we essentially ignore it. We believe that it is better to have a set of criteria based on the available evidence than to have no criteria at all. We therefore decided that: (1) a comprehensive review was needed; (2) that statistical pooling of results would be appropriate if the likely biases in the primary studies were dealt with; and (3) that because all antidepressants had not been compared directly, a statistical technique called multiple treatments meta-analysis (MTM) should be used to calculate estimates of the comparative efficacy and acceptability for all possible comparisons.15 ,16 This approach has already been used successfully in other fields of medicine where relevant head to head comparisons are not available.17 ,18

Main findings of our review

A systematic review of 117 studies, including 25 928 individuals randomly assigned to 12 different new generation antidepressants, was conducted.2 We found that in terms of response, mirtazapine, escitalopram, venlafaxine and sertraline were more efficacious than duloxetine, fluoxetine, fluvoxamine, paroxetine and reboxetine. In terms of acceptability, escitalopram, sertraline, citalopram and bupropion were better tolerated than mirtazapine and venlafaxine.

These results, based on all head to head comparisons using all direct and indirect evidence, suggest that: (a) sertraline might be the best choice when starting treatment for moderate to severe major depression in adults because it has the most favourable balance between benefits, acceptability and acquisition cost; (b) reboxetine, fluvoxamine, paroxetine and duloxetine were the least efficacious and acceptable drugs, making them less favourable options when prescribing an acute treatment for major depression; and (c) in terms of acceptability, reboxetine was the least tolerated agent among the 12 antidepressants and was significantly less effective than all the other 11 drugs, thus suggesting that it should not be used as a routine firstline acute treatment for major depression.

Sensitivity analyses were performed to check the robustness of findings. The overall ranking was not affected by the inclusion of only those studies with dosages within the therapeutic range, or the inclusion of only those studies without data imputation. A meta-regression analysis was also conducted to investigate the effect of sponsorship on outcome estimate, but estimates and ranking did not substantially change.

The protocol for the review was established in advance and published online (see We decided a priori to combine studies (patients, interventions, outcomes, methods) as far as we were able to assume that they would provide qualitatively similar results and adhered to the protocol throughout the review processes (study selection, data extraction and data pooling). We examined the validity of our assumptions by checking heterogeneity in the comparisons and coherence in the network. We did not observe greater heterogeneity or incoherence than could be explained by the play of chance alone.

Response to Gartlehner and Gaynes

Gartlehner and Gaynes1 are particularly concerned about the possibility that the pooled trials included heterogeneous patient populations. They also question the validity of pooling outcome data derived from different rating scales and the validity of dropout by any cause as a measure of treatment acceptability. Furthermore, they question the use of odds ratios (ORs) rather than relative risks (RRs) as an appropriate metric of relative treatment effect and our conclusions about the clinical relevance of the results.

Assumptions of multiple treatments meta-analysis and similarity of populations

As Gartlehner and Gaynes note,1 one key assumption behind any indirect comparison is that the patient samples in the individual trial are reasonably similar with respect to prognosis, severity of disease and other important confounders. In practice, this refers to two highly correlated notions of homogeneity (similarity across studies) and coherence (similarity across comparisons). There should always be healthy scepticism about the similarity of combined studies in meta-analyses, and clinical as well as statistical tools should always be used to evaluate these assumptions.19 Although Gartlehner and Gaynes highlight this issue,1 drawing on arguments similar to those used in the early meta-analysis era, their criticisms are generic and they do not specify any characteristic that might limit the comparability of studies included in our meta-analysis. We should add that statistical heterogeneity and incoherence were checked and found to be acceptable, and patient characteristics were judged sufficiently similar.

To ensure the similarity of included patient populations, we used the following criteria: (a) age, only adult individuals were included, and we noted that only eight out of 117 studies recruited individuals older than 65 years only. The inclusion of these eight studies should not have hampered the analysis, considering that the efficacy of second generation antidepressants does not differ in elderly patients or very elderly patients compared with younger patients20; (b) dose comparability, we used the same dose comparability scheme as Gartlehner and in addition we carried out a sensitivity analysis that confirmed the main findings; (c) trial length, we imposed a narrow limit on the time of primary outcome assessment (8 weeks; range 6–12 weeks) which we believe is a methodological advance over previous meta-analyses in this field. For example, Gartlehner et al combined the results from trials with durations as widely dispersed as 6 to 24 or 52 weeks20; (d) diagnosis, we included studies of major depression according to modern operationalised criteria and did not include minor depression, as Gartlehner and Gaynes suggest in their commentary.1

Measurement of response to treatment

In our analysis, the primary outcome was response to treatment, defined as the proportion of patients who had a reduction of at least 50% from baseline on the Hamilton Depression Rating Scale (HAM-D) or Montgomery–Asberg Depression Rating Scale (MADRS) or who scored much improved or very much improved on the Clinical Global Impression (CGI) at 8 weeks. Gartlehner and Gaynes note that, to produce valid results in indirect comparisons of response rates, the essential assumption is that a response on one scale equals a response on the other scales.1 This important observation about outcome measurement, true of both direct and indirect comparisons, merits further discussion in relation to clinical trials of antidepressants. While in other fields of medicine the identification of appropriate outcome measures may be a relatively straightforward task, in depression “efficacy” can be an elusive concept, typically quantified by means of “soft” measures such as rating scales.21 The relevance of rating scales has often been questioned by physicians who almost never use them in routine clinical practice, and by methodologists who note their often poor psychometric properties. In our analysis, as in all meta-analyses of antidepressant trials, we had to rely on rating scales to define patient improvement. We listed outcome measures of preference a priori, preferring HAM-D over MADRS over CGI. The HAM-D was used in 96 out of 117 studies, and we had to resort to MADRS and CGI in 18 and three studies, respectively. Gartlehner and Gaynes1 highlighted that the convergent validities among these scales are not perfect but this criticism similarly applies to the HAM-D itself because the reported inter-rater and test–retest reliability coefficients can be as inadequate as the concurrent validity coefficients between HAM-D and MADRS or CGI.22 To some extent, even combining HAM-D with HAM-D must be undertaken with caution!

Measurement of treatment acceptability

Another controversial point is that overall discontinuation rates may not be an adequate measure of tolerability or safety because remission or lack of efficacy can also be causes of dropout and mask important differences in adverse events. A depressive episode is a clinical condition that needs continuation treatment, very different from an acute infection where stopping treatment shortly after remission may be medically advised. We argue that dropouts for any reason can be a valid index of acceptability of treatment because we do expect patients to stay on the medication even if (or to be more exact, precisely because) they remitted on that medication within 8 weeks. If a patient in a clinical trial drops out of treatment despite good response, this is clearly a case of lack of acceptability. In antidepressant drug trials, patients are required to continue with the antidepressant even when they feel better. If they failed to continue their antidepressant medication as required then this seems a reasonable indication that treatment did not work or was problematic in some way.

The criticism of the use of trial dropout as an index of overall acceptability is somewhat surprising since, in response to concerns over the use of scales, there has actually been a renewed interest in using it as a reasonably “hard” and simply measured outcome in the field of mental health along with similarly pragmatic events, including suicide attempts, treatment switching, hospitalisation or job loss. Thus recent large trials, for example the CATIE trial, used dropout as the primary outcome.23 We have previously used this logic in a recent systematic review and meta-analysis that employed treatment discontinuation as a hard measure of treatment effectiveness and acceptability.24 In summary, although we would not pretend that dropout is a perfect outcome measure, we believe that Gartlehner and Gaynes overstate the problems.

Odds ratio versus relative risk

Another point of criticism is the use of the OR rather than the RR. According to Gartlehner and Gaynes,1 ORs might have provided substantially larger values than RRs and they are concerned with the consequences of possible misinterpretation of ORs as RRs. We agree, although the use of ORs in the analysis is preferable as they have better mathematical properties than RRs. Besides, we should emphasise the crucial point that the choice of the effect measure does not actually affect the results or the statistical significance of the results—it is simply that ORs may be perceived to be more important. Consequently, the significant results that we obtained are not simply an artefact of our choice of effect measure as Gartlehner and Gaynes rather simplistically suggest. Moreover, ORs were selected as the most appropriate effect measure to present in our ranking of the 12 antidepressants (fig 2 of our paper2) because the OR of an event can be easily calculated into the OR of a non-event by simply taking its reciprocal, while such a relationship does not hold for the RR. ORs can be transformed into RRs if the baseline risk is known or assumed, and we expected informed readers can make full use of our table through such calculations. Furthermore, online access to the dataset enables readers to check if the use of different measures materially affects the results (see

Who is delighted by our conclusions?

Gartlehner and Gaynes state that the pharmaceutical companies marketing sertraline and escitalopram must have been delighted by our conclusions.1 We are very disturbed by the inference that our independent and objectively obtained results and conclusions should have been suppressed just in case those with financial interests might derive benefit. Our analysis was entirely motivated by the need to determine which drugs might promise most benefits for patients and we have no interest in which companies might derive commercial benefit from the results. We wonder if Gartlehner and Gaynes might be subtly suggesting another form of sponsorship bias.

Gartlehner and Gaynes might also consider that sertraline is now off patent in North America and Europe, and that escitalopram will be soon. We hope it is first and foremost the clinicians and their patients who have been helped and delighted most by our findings.25

Clinical relevance of differences and implication for practice

In November 2008, Gartlehner and Gaynes published a study in the Annals of Internal Medicine that employed similar but not identical methods to our analysis.20 They found that “for most comparisons, differences in treatment effects were similar between the two studies (ie, Gartlehner and colleagues20 and Cipriani and colleagues2); in both studies some of the comparisons rendered statistically significant differences in response rates”. We note that Gartlehner and Gaynes admit that they reached similar conclusions by employing the methods that they now denounce. We strongly believe that it would have been a real disservice to individuals worldwide who suffer from depression if we had ignored, for example, the difference between sertraline and reboxetine which translates into a number needed to treat (NNT) of 7, or the NNT of 20 for sertraline over paroxetine. We recognise that this could be overridden by some important differences in costs or side effects but, in their apparent absence, the NNT may be a reasonable criterion for guiding treatment choices.

We freely acknowledge that different interpretations may be derived from the same risk estimates. This is the reason why we presented in fig 3 of The Lancet paper all ORs for all comparisons, both for efficacy and acceptability.2 Clinicians interested in a specific antidepressant may see how it behaves in comparison with the others. Hence readers may develop their own judgment based on the available evidence. Readers who have concerns about how risk estimates were calculated may freely access the full dataset to replicate our analysis.

Instead of systematic review of clinical trial data, Gartlehner and Gaynes suggested that clinically relevant differences should be detected by “large, pragmatic, randomised controlled trials that directly compare the benefits and harms of second generation antidepressants”.1 Certainly, this is easier said than done. A trial that compares 12 new antidepressants and makes each of the 66 possible comparisons would require a total sample size of 10 632 patients (alpha = 0.05 and beta = 0.20) to detect at least a small effect (effect size = 0.2). And if we further wish to detect a difference in side effects of more than 5% point, the total sample size needed jumps up to 29 663 patients. Finally, but perhaps more importantly, a clinical trial should start from the assumption of the central ethical principle equipoise; a randomised controlled trial should take place only if there is true uncertainty about which intervention is most likely be superior. Despite the limitations of our study, it casts serious doubts on the assumption that all antidepressants have similar efficacy.

Concluding remarks

We are aware that the results of our review challenge the standard thinking that most antidepressants are of similar average efficacy and tolerability. Even when significant differences are observed between drugs, these tend to be minimised and considered clinically insignificant.20 We have already noted that this is a comfortable position for industry as it sets a low threshold for the introduction of new agents which can then be marketed on the basis of small differences in specific adverse effects rather than on clear advantages in terms of overall average efficacy and acceptability.26 Nonetheless, other evidence as well as our review suggests that antidepressants may vary in both their efficacy and adverse effects,27 ,28 ,29 ,30 ,31 and also in terms of hard outcome measures, for example the risk of suicide. The results of a Food and Drug Administration (FDA) meta-analysis that included 372 placebo controlled antidepressant trials and nearly 100 000 patients, posted more than 2 years ago on the FDA website,32 and recently published in the BMJ,33 found that the odds of suicidal behaviour for sertraline, for example, is around half that on placebo.34

We are aware of the limitations of our approach. Firstly, the exclusion of placebo controlled trials inevitably kept placebo out of the network of comparisons, with the negative consequence of reducing the amount of data. It is therefore possible that the provided estimates have lower precision compared with those including the placebo controlled trials. A common misunderstanding in MTM is that the choice of a baseline comparison might affect the results. The choice of common comparator is usually arbitrary or based on common sense (eg, placebo) and effects only the interpretation of the results, not their validity. As MTM assumes transitivity, all pairwise comparisons can be derived from our table. In our analysis, we arbitrarily decided to express the comparative efficacy among the 12 antidepressant drugs using fluoxetine as reference drug for convenience. Fluoxetine was used because it was the first among these 12 antidepressants to be marketed in Europe and the USA, and it has been consistently used as the reference drug among the different pairwise comparisons.

A second limitation is that we studied only a selected number of antidepressants, and therefore we do not know how other agents would have ranked within the network. In particular, the exclusion of some old antidepressants that have been shown in direct meta-analyses to have the edge in terms of efficacy over newer drugs does not allow us to clarify their possible role in the treatment of depression.27 A further reason for concern is that efficacy was treated as a dichotomous outcome only. Although we think it unlikely that treating the outcomes as dichotomous rather than continuous variables would have produced qualitatively different results rather than slight quantitative differences in the estimates, it would have been of interest to include both measures, as debate persists on the pros and cons of these two approaches.35 Finally, our 12 preparatory systematic reviews were as comprehensive as possible, involving contact with the original authors and all relevant pharmaceutical companies to obtain further data. Although none of the funnel plots in the individual head to head meta-analyses were suggestive of publication bias, and the meta-regressions taking into account sponsorship bias did not change the results in the MTM, the possibility of residual publication bias and outcome reporting bias cannot be excluded.

We believe that, despite the likely biases of the included trials, and the limitations of our approach, our analysis makes the best use of the randomised evidence, providing clinicians with evidence based criteria that can be used to guide treatment choices. Additionally, by suggesting that sertraline is better than other new generation drugs, our analysis indicates that this antidepressant could be used as a standard comparator in phase III trials. The requirement for a comparison with existing drugs would add value to the regulation process involved in the introduction of new drugs to the market.36 It is of note that the FDA may incorporate comparative effectiveness information into labelling.37 Such changes to procedure would set a higher threshold for the licensing of new drugs, motivating investigators to develop truly innovative drugs.


Supplementary materials


Linked Articles