In recent years, network meta-analyses have been increasingly carried out to inform clinical guidelines and policy. This approach is under constant development, and a broad consensus on how to carry out several of its methodological and statistical steps is still lacking. Therefore, different working groups might often make different methodological choices based on their clinical and research experience, with possible advantages and shortcomings. In this contribution, we will critically assess two network meta-analyses on the topic of pharmacological prevention of relapse in schizophrenia, carried out by two different research groups. We will highlight the implications of different methodological choices on the analysis results and their clinical–epidemiological interpretation. Moreover, we will discuss some of the most relevant technical issues of network meta-analyses for which there is not a broad methodological agreement, including the assessment of transitivity.
- Schizophrenia & psychotic disorders
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
Pharmacological prevention of relapse in schizophrenia is a debated topic. Recently, two different research groups attempted to address this relevant clinical issue by synthesising data from randomised controlled trials (RCTs) using network meta-analysis (NMA) methodology.1 2 In this contribution, jointly written by some of the researchers involved in both NMAs, similarities and differences of these two approaches are examined in order to discuss how different methodological choices can be applied to the same clinical problem and whether they contributed to differences in results and conclusions. As a second aim, we critically appraised some of the current technical issues of NMAs, exemplified in these two NMAs, suggesting possible future developments.
Similarities and differences between the NMAs
Both NMAs included RCTs enrolling clinically stable adults with schizophrenia spectrum disorders (table 1), relying mostly on the definition of ‘clinical stability’ provided by primary studies, but with slight differences. Ostuzzi et al 2 additionally included those RCTs where, although individuals were not clearly defined as ‘clinically stable’, mean severity scores (eg, Brief Psychiatric Rating Scale) at baseline indicated relatively low levels of psychopathology, according to commonly employed cut-offs. This led to the inclusion of 14 RCTs (1486 individuals) that would have been otherwise excluded. Further, Schneider-Thoma et al 1 excluded individuals with prominent negative symptoms, considering that this specific subgroup of individuals may not be comparable (transitive) with the target population in different regards, while Ostuzzi et al did not apply this exclusion criterion. This accounted for some differences, such as amisulpride not being included in the network of Schneider-Thoma et al, as it was only investigated in trials for prominent negative symptoms. Overall, for the primary outcome (ie, relapse), Ostuzzi et al included 89 RCTs (22 275 participants) and Schneider-Thoma et al included 100 RCTs (16 812 participants). Despite similar inclusion criteria, only 44 RCTs contributed to this analysis for both NMAs.
We note that being more or less inclusive ultimately depends on methodological considerations, as well as the clinical perspective of the working group. On one hand, being inclusive might increase heterogeneity between studies and threaten the assumption of transitivity, which postulates that included RCTs should be similar in the distribution of all potential effect modifiers except for the treatments being compared.3 On the other hand, a more inclusive approach might have the benefits of increasing the statistical power and the connectivity of the network, improving external validity and applicability of results by including a broader range of clinical features commonly seen in real-world practice.
Moreover, employing rating scales’ cut-off scores to include/exclude RCTs in meta-analyses is not a routine approach and could be questioned. In this case, ‘clinical stability’ is a complex construct that might not be exhaustively informed by symptom severity only. By contrast, as RCTs employ heterogeneous definitions of ‘clinical stability’, using a common cut-off could be a more consistent measure of symptom severity across different studies.
Similar considerations can apply to the construct of ‘relapse’ and how it should be measured. Both NMAs used the definition of relapse provided by the authors of primary studies, but Ostuzzi et al additionally imputed missing data from mean change scores, again using cut-off thresholds (ie, an increase of at least 25% of the baseline Positive and Negative Syndrome Scale at the end of the study). Although such thresholds have been commonly used as an approximation of relapse in randomised trials,4 there is debate around the definition of clinically relevant change according to rating scale scores, and some cut-offs might be appropriate for some subpopulations of patients (ie, negative symptoms) and not for others.4–6
Notably, possible biases related to broad inclusion criteria and data imputation can be tested by means of sensitivity analyses, that is, excluding RCTs with specific assumptions (for example, those for which ‘clinical stability’ or ‘relapse’ was imputed). In this case, sensitivity analyses performed by both working groups were largely consistent with primary results, supporting the pragmatism of such methodological choices.
Of relevance, both NMAs compared individual antipsychotics against each other; however, the same antipsychotics were used in the context of different study designs, reflecting different treatment strategies. In particular, participants stabilised with one antipsychotic might be randomised to continuing, decreasing the dose, switching or stopping it (ie, switch to placebo). This might bias the interpretation of results, as different treatment strategies might be included under the same antipsychotic (and therefore the same ‘node’).7
Current technical issues exemplified in the two NMAs
Despite growing literature and guidelines on how to technically carry out an NMA, many choices are ultimately taken on the basis of very pragmatic considerations, including different perceptions on the nature of the clinical problem, its application in real-world practice, as well as clinical and research experience. This should be regarded as a value because, as long as the methodology is preplanned, transparently and rigorously applied, it allows different viewpoints of the same clinical phenomenon, giving nuances to the discussion around the practical application of clinical–epidemiological data.
Although NMAs are increasingly carried out, and international institutions (such as the Cochrane Collaboration)8 and experts are constantly updating methodological guidelines, some technical and practical issues have not been standardised yet. We chose to discuss some of those that are practically exemplified in these two NMAs.
First, although transitivity is an essential assumption of NMAs, standardised approaches to systematically assess (and ideally quantify) clues of its violation are not available. Many published NMAs do not even perform a formal assessment of this assumption. According to the Cochrane Handbook: ‘transitivity can be evaluated by comparing the distribution of effect modifiers across the different comparisons’,8 which is usually done by visually inspecting box plots of effect modifiers by treatment edges. Moreover, given that the current method of choice for testing for global inconsistency is the design-by-treatment interaction model,9 built on the idea that inconsistency may not only be at the ‘loop’, but also at the ‘design’ level (ie, the list of compared treatments), box plots by design of the study is another valid approach.10 Comparing treatment edges might be more convenient if causes of possible inconsistency are sought, as the core of NMAs is the identification of treatments effects, which are calculated through pairwise comparisons of outcomes. On the other hand, using study designs allows for testing the homogeneity of distribution across possible effect modifiers, which might be biased when using treatment edges in the presence of multiarm studies. More in general, we note that an accurate interpretation of box plots is rather difficult and subjective, particularly when the network is sparse (ie, with many comparisons and only few trials for comparison). Aiming to overcome these difficulties, Ostuzzi et al attempted to adopt an inferential approach to detect possible distribution imbalances (ie, a Kruskal-Wallis test), on the grounds that, in a random-effects model, the assumption of equal distribution of effect modifiers across comparisons does not hold exactly (that would not be realistic, due to absence of randomisation between trials), but only in expectation.11 Even such approach however does not give a definitive answer on the possible causes of inconsistency, since tests across nodes or edges would not respect the independence assumption, while tests across designs are likely to suffer from lack of power unless there is a high number of trials for each design (which is very unlikely). However, it should be highlighted that absence of inconsistency should not be interpreted as evidence of transitivity, whose assumption should be assessed on a theoretical level before the NMA is conducted.12
Second, Ostuzzi et al excluded small studies (including less than 50 participants), considering that they tend to show higher heterogeneity and they may bias estimates due to omission of publications with non-significant findings in case of publication bias.13 However, as noticed by Zhang et al, omission of studies from NMAs, regardless of the reason, may have a substantial impact on estimated results14 and is therefore not recommended in general.15 Considering that sample size did not act as an effect modifier, inclusion of small studies could have led to higher precision of estimates in Ostuzzi et al. However, it is also possible that it could have led to higher heterogeneity and subsequently reduced precision in the random-effects NMA model as it is the case in Schneider-Thoma et al.
Third, the two working groups used different approaches to the analysis, namely Bayesian (Schneider-Thoma et al) and frequentist approach (Ostuzzi et al). As noted by Seide et al, the former approach is more prone to the risk of overconservativeness and bias in case of small-to-negligible heterogeneity, while the latter to the one of anti-conservativeness and bias in case of high heterogeneity.16 Overall, an underestimation of the variance, along with the exclusion of small trials (typically associated with a higher heterogeneity13), might have contributed to the lower estimated heterogeneity found by Ostuzzi et al.
Fourth, the two statistical analyses differ in the effect size measure used. Ostuzzi et al analysed risk ratios of relapse, whereas Schneider-Thoma et al analysed ORs (and then transformed the NMA results to risk ratios for presentation to increase interpretability). This illustrates that it is currently not clear but vividly discussed which effect size measure should be used in meta-analysis of binary outcomes, since both approaches are prone to criticisms.17 18
Fifth, the qualitative assessment of included RCTs and of pooled estimates might suffer from a certain degree of discretion, possibly leading to different data interpretation. The Cochrane Risk of Bias V.2 (RoB2) includes five domains of bias, each of which includes a series of ‘signalling questions’ to help the researcher judge if the RCT has a ‘low’ or ‘high’ risk of bias, or if there are ‘some concerns’.8 After comparing the RoB2 overall judgements of the 44 RCTs included by both working groups for the outcome relapse, we found an agreement of 73.9%, indicating a low inter-rater reliability (k=0.0899, SE=0.1197), consistently with previous observations.19 Further, in order to assess the overall confidence in pooled estimates of the NMA, the CINeMA (Confidence in Network Meta-Analysis) approach was employed.20 This methodology is broadly based on the GRADE (Grading of Recommendations Assessment, Development and Evaluation) framework, and aims to assess six domains, namely within-study bias (based on the RoB2 judgements), reporting bias, indirectness, imprecision, heterogeneity and incoherence. This approach, although largely automatised through a web-based application, still requires some methodological decisions throughout the process, such as defining the cut-off for clinically relevant effect, and defining how to summarise ‘within-study bias’ and ‘indirectness’ across contributions for each network estimate as well as how the overall judgement is reached (because different domains are interconnected and therefore should be considered jointly). Different approaches might often change the overall CINeMA report, affecting certainty in the reliability and applicability of results. Therefore, clinical and policy conclusions strongly based on the certainty of evidence might be criticised as being excessively discretional. As for our example, despite the two NMAs providing largely similar results, conclusion somehow differed. Ostuzzi et al indicated three best-performing antipsychotics supported by moderate-to-high certainty of evidence against placebo for both long-acting and oral formulations (table 1), while Schneider-Thoma et al refrained from recommending specific antipsychotics because, according to their evaluations, none reached high certainty of evidence, and few statistically clear differences emerged between individual medications.
NMAs are increasingly carried out in many fields of healthcare, and their results can largely inform the development of clinical practice guidelines and health policy actions. However, such analyses require a number of methodological choices, which can notably vary across different working groups. Although some of these technical issues can be developed further, and hopefully agreed upon by methodologists and be systematically applied to ensure the most accurate results, some other are largely dependent on clinical and research experience, and cannot be easily standardised. This should be regarded as a possibility and not a limit, as long as implications are critically and transparently discussed. It remains imperative that any methodological decisions should be indicated in advance in the study protocol, in order to avoid post-hoc choices based on review findings, and ultimately enhance transparency and replicability of results.
Patient consent for publication
GO and JS-T are joint first authors.
SL and CB are joint senior authors.
Contributors GO, CB, JS-T and SL developed the concept of this paper. GO and JS-T wrote the first draft, which was critically discussed and amended by all authors. FT contributed to writing and revising the statistical and methodological contents. All authors read and approved the final version of the paper.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.