Background Cut-offs on self-report depression screening tools are designed to identify many more people than those who meet criteria for major depressive disorder. In a recent analysis of the European Health Interview Survey (EHIS), the percentage of participants with Patient Health Questionnaire-8 (PHQ-8) scores ≥10 was reported as major depression prevalence.
Objective We used a Bayesian framework to re-analyse EHIS PHQ-8 data, accounting for the imperfect diagnostic accuracy of the PHQ-8.
Methods The EHIS is a cross-sectional, population-based survey in 27 countries across Europe with 258 888 participants from the general population. We incorporated evidence from a comprehensive individual participant data meta-analysis on the accuracy of the PHQ-8 cut-off of ≥10. We evaluated the joint posterior distribution to estimate the major depression prevalence, prevalence differences between countries and compared with previous EHIS results.
Findings Overall, major depression prevalence was 2.1% (95% credible interval (CrI) 1.0% to 3.8%). Mean posterior prevalence estimates ranged from 0.6% (0.0% to 1.9%) in the Czech Republic to 4.2% (0.2% to 11.3%) in Iceland. Accounting for the imperfect diagnostic accuracy resulted in insufficient power to establish prevalence differences. 76.4% (38.0% to 96.0%) of observed positive tests were estimated to be false positives. Prevalence was lower than the 6.4% (95% CI 6.2% to 6.5%) estimated previously.
Conclusions Prevalence estimation needs to account for imperfect diagnostic accuracy.
Clinical implications Major depression prevalence in European countries is likely lower than previously reported on the basis of the EHIS survey.
- adult psychiatry
- depression & mood disorders
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
WHAT IS ALREADY KNOWN ON THIS TOPIC
Based on the Patient Health Questionnaire-8 (PHQ-8), the prevalence of current depressive disorder among participants in the European Health Interview Survey (EHIS; 27 countries, n=258 888) was recently reported to be 6.4%.
A 2021 individual participant data meta-analysis (44 studies, 9242 participants) found that the PHQ-9, which performs equivalently to the PHQ-8, identifies 2.5 times as many major depression cases as a validated semi-structured diagnostic interview.
WHAT THIS STUDY ADDS
In this study, we accounted for the imperfect diagnostic accuracy of the PHQ-8 using a Bayesian framework and estimated a considerably smaller overall prevalence of 2.1% across Europe.
Despite large differences in the proportion of observed positive tests between countries, accounting for diagnostic accuracy of the PHQ-8 resulted in insufficient power to establish prevalence differences.
HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY
This study highlights a method to account for imperfect diagnostic accuracy in prevalence estimation and suggests that prevalence estimates from population-based surveys need to be interpreted carefully.
Depression is a leading cause of burden from disease worldwide,1 responsible for approximately 2.2 million excess deaths in 20102 with numbers rising over recent decades.3 Effective public health interventions could reduce the burden of depression.4 Such strategies, however, must be informed by accurate data collected in large-scale population-based studies.
The largest study of depression prevalence across European countries in recent years was based on data from the European Health Interview Survey (EHIS), a large-scale population survey intended to inform health policy in Europe.5 Participants with scores ≥10 on the Patient Health Questionnaire-8 (PHQ-8) screening tool were classified as having major depression. Authors reported an overall prevalence of 6.4% (95% CI 6.2% to 6.5%) and estimates for 27 European countries that ranged from 2.6% (2.1% to 3.0%) in the Czech Republic to 10.3% (9.3% to 11.3%) in Iceland. There were large and significant differences between European countries.5 This paper has been broadly cited.
Depression screening tools, including the PHQ-8 used in the EHIS, are designed to identify many more people than those who will be eventually determined to have a disorder after more comprehensive clinical assessment. These questionnaires are not designed to make definitive clinical diagnoses,6 7 as positive screening results can be explained, for example, by accompanying depressive symptoms of other mental disorders. Also, persons who actually have a major depression, but are responding to treatment, might screen negative.
Ideally, major depression prevalence would hence be estimated on the basis of validated diagnostic interviews which are designed to replicate the diagnostic process, including assessment of symptom severity and impairment and ruling out alternative origins. These methods have been used in large population-based studies,8 and the resources required can be reduced using strategies such as two-step implementation in conjunction with self-report screening tools.9
Nonetheless, brief self-report assessments are attractive and commonly used in studies on the prevalence of mental health conditions, as they are easy to administer to a large number of participants. Ignoring their imperfect diagnostic accuracy, however, leads to exaggerated prevalence estimates. An individual participant data meta-analysis of 44 primary studies (9242 participants) found that scores of ≥10 on the PHQ-9, which performs equivalently to the PHQ-8,6 overestimated major depressive prevalence by 2.5 times compared with a validated semi-structured diagnostic interview.10 This finding is consistent with earlier research7 11 and research on other depression screening tools.12 13
Reporting of the positive test rate of the PHQ-8 as an index of population prevalence leads to an overestimation of true prevalence and can misinform public health policy making. It is therefore essential to account for imperfect diagnostic accuracy properly to estimate prevalence. One strategy is to incorporate prior information about sensitivity and specificity9 in a Bayesian framework14–16 in order to estimate depression prevalence.
The objectives of this study were to (1) estimate prevalence of major depression in Europe, taking into account the imperfect diagnostic accuracy of the PHQ-8 screening tool, (2) assess differences in prevalence between countries and (3) compare results with those from a previous EHIS study, which assumed perfect PHQ-8 diagnostic accuracy for identifying major depression.
This was a cross-sectional study that analysed data from the second wave of the EHIS. We included data from 27 countries where the PHQ-8 was administered. It is reported in accordance with the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guideline.17
The EHIS is a population-based study, which provides harmonised data on health status, healthcare use and health determinants as well as socioeconomic background variables across European countries. The EHIS is nationally organised and conducted every 5 years. The second wave took place between 2013 and 2015 in all European Union member states, Iceland and Norway.18 It included 30 countries, of which 27 administered the PHQ-8 and were included in the present study. Study planning, sampling, data collection, procession and submission were conducted following guidelines from Eurostat.19
All EHIS study material was translated following a standardised translation protocol, including at least two translators with the target language as their mother tongue. In 21 countries, the study material was pretested. Data were collected in different countries by face-to-face interviews, telephone interviews, postal or online questionnaires or a combination of these methods. Data were mainly collected in 2014 and took 8 months on average for each country.18
The EHIS targets the European population aged 15 years and older living in private households. Depending on the country, one of three types of sampling frame was used: population registers, dwelling registers and population censes. The most common sampling design was a two-stage or three-stage stratified or systematic (cluster) sampling design with the individual being the ultimate sampling unit. The overall sample size in each country was determined to achieve a precision requirement of <1 percentage point error for the most critical variable in the survey. Out of 27 countries included in this study, for 9 the minimum effective sample size was not available, for 5 the target was not achieved and for 13 the minimum effective sample size was obtained. The unit non-response rate by country ranged from 16% to 70% with the highest rate of non-response when only self-administered questionnaires were collected.
Weighting factors for each individual were calculated to model the eligible population of each country, to account for design features of the survey and to reduce bias caused by non-response. Details can be found in the EHIS Methodological manual19 (pp 156–160). Eurostat reported that study guidelines and implementation regulations have been closely followed, resulting in an overall sufficient or even good comparability across countries of the data.18
The PHQ-9 consists of nine items representing the nine symptoms included in diagnostic criteria for major depression according to the Diagnostic and Statistical Manual of Mental Disorders, fourth edition.20 Each item is scored using a 4-category Likert scale that reflects frequency during the past 2 weeks (0–3 points; not at all to nearly every day). The PHQ-9 has been widely adopted in clinical research and practice.21 The PHQ-8, which is commonly used in research and was used in the EHIS, omits an item of the PHQ-9 regarding suicidal ideation and self-harm. Total scores for the PHQ-8 can range from 0 to 24. Based on 27 studies included in an individual participant data meta-analysis of the PHQ-8 (6362 participants, 790 major depression cases), the cut-off of ≥10 maximised combined sensitivity and specificity compared with major depression classification based on a semi-structured interview. Sensitivity and specificity were estimated to be 0.86 (95% CI 0.80 to 0.90) and 0.86 (95% CI 0.83 to 0.89), although with considerable heterogeneity across studies.6
We used a Bayesian Latent Class Model to estimate major depression prevalence based on the PHQ-8.15 22 The two latent (unobserved) classes represent disease status (major depressive disorder (MDD), no MDD). Based on observed PHQ-8 test status and prior information on PHQ-8 test characteristics, we estimate the probability of class membership, which is depression prevalence.
Replicating Arias-de la Torre et al,5 PHQ-8 scores ≥10 were considered positive. Prior information on sensitivity and specificity from an comprehensive meta-analysis6 was employed probabilistically as prior distributions. The model was fitted using Markov Chain Monte Carlo methods.23 Complete statistical analysis methods are described in online supplemental appendix 1.
Given the differences in sampling and data collection, we considered the EHIS as 27 independent studies (one per country). In any country i, the observed number of the test positives yi out of the ni tested individuals was assumed to follow a binomial distribution with country-specific probability for a positive test :
where can be expressed as the sum of the probabilities for true positive ( ) and false positive tests ( ). These can be expressed in terms of prevalence , sensitivity and specificity :
We used a joint multivariate normal prior on the logit of and 24:
and a beta prior for :
Prior specification and model estimation
Priors represent existing knowledge about model parameters. Prior information about sensitivity and specificity of the PHQ-8 was derived from a recent comprehensive individual participant data meta-analysis.6 We modelled the prior distribution from estimates of mean logit-sensitivity , mean logit-specificity , between-study variances and between-study correlation ρ .25 For the cut-off of ≥10, this yielded the following prior distribution:
In our main analysis, we did not include informative prior information on depression prevalence to maintain comparability to the analysis reported by Arias-de la Torre et al 5 and used a uniform prior:
To investigate the appropriateness of the priors derived above and to inform prior sensitivity analysis, we performed prior predictive checks (see online supplemental appendix 2).
All models were fitted in Stan26 using Markov Chain Monte Carlo sampling (4 chains, 5000 iterations, 2500 warm-up iterations). We examined trace plots, values, effective sample size and autocorrelation plots to assess model convergence.23 We performed posterior predictive checks to investigate whether the model was adequate to describe the observed data. Code to fit the model can be accessed at Open Science Framework (https://osf.io/w7fj2).
We assessed the joint posterior distribution of the model parameters . We reported posterior means and 95% CrIs of and for each country and compared the marginal and joint posterior distributions of and to the respective prior distributions.
Based on the joint posterior distribution, we estimated the expected numbers of true positives , false positives , true negatives and false negatives for each country. We estimated the ratio between true prevalence and positive test frequency as well as the posterior probability that the positive test frequency overestimates the true prevalence.
Finally, we assessed major depression prevalence differences between countries by inspecting the respective posterior distributions and reported the mean and 95% CrI of these differences for all pairwise comparisons. We also assessed, for each pair of countries, the posterior probability that the prevalence difference between both was greater/smaller than 0. A posterior probability >95% was considered as strong evidence for an actual prevalence difference between countries.
To investigate the robustness of our analysis against different priors, we investigated the impact of prior adjustments derived from prior predictive checks on the posterior distributions. See online supplemental appendix 1 for more details.
The EHIS microdata contained 316 333 observations from 30 countries. We excluded 39 608 observations from Belgium, Spain and the Netherlands, where the PHQ-8 was not administered, 10 139 observations, where a proxy answered the survey instead of the selected person and 7694 observations, where the PHQ-8 sum score could not be calculated due to missing items. A participant flow chart and detailed information of PHQ-8 item non-response are provided in online supplemental appendices 3 and 4.
Overall, 258 888 participants were included in the study, of which 15 757 (6.1%) had a positive depression screening test with a PHQ-8 score of ≥10. Table 1 shows demographic characteristics of the sample (weighted, crude numbers are reported in online supplemental appendix 5). Table 2 shows the sample size as well as the absolute and relative number of positive tests (PHQ-8 score ≥10) for each country (weighted, crude numbers are reported in online supplemental appendix 6). The weighted mean PHQ-8 score was 2.8 (SD=3.8).
Model sampling performed well without any indication of problems. Empirical indicators such as trace plots, autocorrelation plots, (all parameters <1.002) and effective sample size (2225–22 148) indicated appropriate exploration and convergence of the posterior distribution. Posterior predictive checks indicated that the fitted model could generate the observed data well (see online supplemental appendix 7).
Figure 1 shows the posterior prevalence of major depression for each of the included countries. We estimated an overall prevalence of 2.1% (95% CrI 1.0% to 3.8%), which is considerably lower and less precise than the prevalence estimate of 6.4% (95% CI 6.2% to 6.5%) reported by Arias-de la Torre et al.5
We observed this pattern in every country. Although 10.3% of the participants in Iceland had a PHQ-8 score ≥10, the mean posterior prevalence estimate was only 4.2%. The 95% CrI indicates that a wide range of prevalence values from 0.2% to 11.9% would be in line with prior information on PHQ-8 test characteristics and the observed data.
Table 3 reports the posterior means and 95% CrIs for the percentage of TP, FP, TN and FN PHQ-8 results. For example, we expect in Austria 95.3% of all tests to be TN, 0.8% TP, 0.4% FN and 3.5% FP. Hence, the ratio of prevalence (TP+FN) to positive tests (TP+FP) is 0.27 (05% CrI 0.01 to 0.90). The posterior probability that the true prevalence was smaller than the positive test frequency was 98.3% for Austria, which was similar across all countries.
We estimated the largest prevalence difference between Iceland and the Czech Republic with an estimated difference of 3.6% (95% CrI −0.6% to 11.3%, Pr=0.079). For no pairwise comparison, we can determine ( ) which country has the lower depression prevalence (see online supplemental appendix 8).
Analysis of sensitivity and specificity
Prior predictive checks indicated that the observed numbers of positive tests in the EHIS were unlikely given the prior information on specificity (see online supplemental appendix 2). The posterior distribution of indicated this as well (see online supplemental appendix 9). was estimated to be at least 0.90, whereas the prior distribution suggested a between 0.70 and 0.90. The posterior distribution had a greater mean and smaller variance compared with the prior distribution, suggesting that specificity in the EHIS was greater than in the available diagnostic studies that we used to establish our prior distribution. Online supplemental appendix 8 provides a more detailed explanation, how our model holds information about despite that true depression status is not available, and shows the joint prior and the posterior distributions of and , as well as their 95% CrIs.
Prior sensitivity analysis
We report results of a prior sensitivity analysis in online supplemental appendix 10. Compared with our main analysis, imposing a liberal assumption that depression prevalence is likely >0.5% and <25.8% resulted in minimally higher prevalence estimates with comparable uncertainty. We found a more pronounced effect when we additionally adjusted the prior on specificity, leading to higher prevalence estimates compared with our initial analysis. Nonetheless, the posterior distribution suggested that actual prevalence was lower than the observed positive test frequency. Even in a purely hypothetical scenario in which we assumed unrealistically high a priori sensitivity and specificity of 95% and 97%, respectively, with high precision of the prior distributions, the uncertainty of the estimates remained rather large and observed positive test frequencies considerably overestimated the prevalence.
We incorporated the best available evidence on the diagnostic accuracy of the PHQ-8, using a cut-off of ≥10, to estimate major depression prevalence in Europe using data from the EHIS survey. Our main findings were that major depression prevalence across Europe was 2.1% (95% CrI 1.0% to 3.8%), that accounting for the imperfect diagnostic accuracy resulted in insufficient power to establish prevalence differences between countries and that previous prevalence estimates from the EHIS are likely overestimates.
Our depression prevalence estimates are in line with studies reporting depression prevalence on the basis of fully structured interviews, for example, 3.3% in the UK27 and 2.0% in the Netherlands.28 On the contrary, a survey in the Czech Republic using the Mini-International Neuropsychiatric Interview (MINI) reported depression prevalence of 4.0%, well above our estimate of 0.6%.29 However, comparisons of these estimates need to consider that semi-structured and fully structured interviews are as well susceptible to measurement error.
The imprecision of the PHQ-8 as a diagnostic tool resulted in large uncertainty about depression prevalence. Hence, power to detect prevalence differences between countries was small. Despite the number of positive tests in Iceland (10.3%) being almost 4 times larger than in the Czech Republic (2.6%), there was still a 7.9% posterior probability that actual prevalence of major depression in Iceland is smaller than in the Czech Republic. This result is in contrast to the conclusion from Arias-de la Torre et al 5 that prevalence varies substantially and statistically significantly across European countries. Furthermore, we show that the relative frequency of positive tests as reported by Arias-de la Torre et al 5 almost certainly overestimates depression prevalence, even if we assume sensitivity and specificity to be substantially higher than previously reported.
Our analysis should not be mistaken for evidence that depression prevalence does not vary across Europe. For example, the posterior probability for depression prevalence being twice as high in Iceland compared with Czech Republic is 84.1%, indicating that the observed data are in line with a broad range of true prevalence differences. Therefore, relying on the PHQ-8 in a population-based survey alone seems insufficient to assess depression prevalence.
We incorporated sensitivity and specificity estimates reported from a large individual participant data meta-analysis6 as the best currently available evidence. Nonetheless, we found a mismatch between prior and posterior information on specificity, suggesting that the PHQ-8 diagnostic accuracy in population-based study is different.6 None of the primary studies using semi-structured diagnostic interviews were conducted in the general population; the mean PHQ-8 score of all included primary studies was 5.9 (SD=5.6),6 whereas in the EHIS it was 2.8 (SD=3.8). This suggests that risk for depression in the negatively screened is lower than in diagnostic studies. Furthermore, diagnostic studies had on average only 236 participants and 29 major depression cases, resulting in imprecise and heterogenous estimates of sensitivity and specificity.6 Direct sampling of sensitivity and specificity from the predictive posterior distribution of the individual participant data meta-analysis6 would enable us to account for uncertainty on estimated heterogeneity, relax distributional assumptions or model associations between diagnostic accuracy and prevalence. Our approach of constructing priors from the published parameter estimates is less accurate, but can be replicated more easily in independent studies.
More precise information on the diagnostic accuracy of the PHQ-8 in the general population would be needed to obtain more meaningful prevalence estimates. This could be potentially achieved by tailoring diagnostic accuracy priors to the specific populations; however, existing evidence did not even allow assessment of country specific diagnostic accuracy. Further possibilities include using different cut-off values or the diagnostic algorithm of the PHQ-9. A promising approach is use a risk model based on the PHQ-8 sum score. To date, participants with a PHQ-8 score of 0 and 9 both constitute negative screens and are treated as equivalent using the standard cut-off of ≥10, although both apparently have a very different probability of major depression. Such a risk model is not available yet for the PHQ-8.
A limitation of the present analysis is that we used a uniform prior for prevalence in the main analysis to maintain comparability with the results of Arias-de la Torre et al.5 Furthermore, one should be aware that our prevalence estimates did not reflect any previous knowledge on major depression prevalence across Europe.
More precise prevalence estimates could be achieved with different strategies. One would be to update prior information on diagnostic accuracy with study specific information on diagnostic accuracy. Such could be obtained using a two stage approach, where a semi-structured clinical interview is conducted in a subsample of the survey participants.7 9 However, an appropriate sampling strategy and a sufficient number of interviews is vital to obtain precise and valid diagnostic accuracy estimates. Further strategies include incorporating auxiliary data, for example, from health insurance records,30 or systematic evidence synthesis of all prevalence studies using different assessment tools like depression screeners, fully and semi-structured interviews.
Our analysis indicates that depression screening tools like the PHQ-8 should not be analysed or interpreted as if they were equivalent to diagnostic interviews in population-based studies. One should therefore not mistake the positive test frequency as a measure of prevalence. Rather, if a screening tool is to be used to attempt to estimate prevalence, appropriate statistical methods must be used to account for the less-than-perfect diagnostic accuracy of depression screening tools, even when sensitivity and specificity appear to be high. Our results call for the development of methods to estimate the probability of major depression more precisely from the PHQ-8 and other depression screening tools, for example, by risk models that are based on the sum score.
Policy makers should be aware that our analysis suggests that less people in Europe suffer from major depressive disorder and that differences between countries are likely smaller than previously reported.5 Public health policy making must not necessarily rely on depression prevalence, but previous prevalence estimates are misleading. Evidence from different sources must be weighted carefully.
Patient consent for publication
This specific study is a secondary analysis of a public and anonymised dataset, which had obtained ethics approval and therefore required no additional ethics approval. The EHIS microdata is available at institutional level from Eurostat. All protocols for conducting the survey for data collection are available on the official Eurostat website at: EUR‑Lex‑02008R1338–20210101-EN‑EUR‑Lex. Participants gave informed consent to participate in the study before taking part.
Contributors FF and PK conceived the study. FF and DZ performed data analysis. FF, DZ, GR, BL, AB, BT, MR and PK contributed to study design and interpretation. FF drafted the manuscript. DZ, GR, BL, AB, BT, MR and PK provided critical reviews and approved the final manuscript. FF as the guarantor of the study accepts full responsibility for the work and the conduct of the study, had access to the data, and controlled the decision to publish.
Funding BL was supported by a Fonds de recherche du Québec—Santé (FRQS) Postdoctoral Training Fellowship, AB by a FRQS researcher salary award and BT by a Tier 1 Canada Research Chair, all outside of the present work.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.