Predicting outcomes at the individual patient level: what is the best method?

Objective When developing prediction models, researchers commonly employ a single model which uses all the available data (end-to-end approach). Alternatively, a similarity-based approach has been previously proposed, in which patients with similar clinical characteristics are first grouped into clusters, then prediction models are developed within each cluster. The potential advantage of the similarity-based approach is that it may better address heterogeneity in patient characteristics. However, it remains unclear whether it improves the overall predictive performance. We illustrate the similarity-based approach using data from people with depression and empirically compare its performance with the end-to-end approach. Methods We used primary care data collected in general practices in the UK. Using 31 predefined baseline variables, we aimed to predict the severity of depressive symptoms, measured by Patient Health Questionnaire-9, 60 days after initiation of antidepressant treatment. Following the similarity-based approach, we used k-means to cluster patients based on their baseline characteristics. We derived the optimal number of clusters using the Silhouette coefficient. We used ridge regression to build prediction models in both approaches. To compare the models’ performance, we calculated the mean absolute error (MAE) and the coefficient of determination (R2) using bootstrapping. Results We analysed data from 16 384 patients. The end-to-end approach resulted in an MAE of 4.64 and R2 of 0.20. The best-performing similarity-based model was for four clusters, with MAE of 4.65 and R2 of 0.19. Conclusions The end-to-end and the similarity-based model yielded comparable performance. Due to its simplicity, the end-to-end approach can be favoured when using demographic and clinical data to build prediction models on pharmacological treatments for depression.


Inclusion criteria
We included patients registered with QResearch with a recorded diagnosis of depression since 1 st Jan 1998. As in other studies, 1-3 we used Read codes to identify cases of depression.
A new Read code diagnosis of depression was considered a new episode of depression when preceded by 12 months of no depression diagnoses and no prescription of antidepressants.
The index date for entry to the cohort was the date of depression diagnosis and patients were followed up for 3 months after the index date.
We considered 12 months of no diagnosis and no prescription of antidepressants as necessary to consider an episode of depression different from the eventual previous one (see below, in the exclusion criteria). This is because with treatment, episodes last on average 3 to 6 months, while most patients recover within 12 months 4 and in general practices the long-term course of depression is more favourable than in clinical samples. 5 Episodes of depression were included only if fluoxetine was prescribed within 12 days around the diagnosis of depression (i.e. 6 days before or after the diagnosis), because we were interested in the prognosis of patients treated with fluoxetine. Episodes where antidepressants were started 6 or more days prior to the index date of depression diagnosis were considered as if antidepressants were prescribed for other reasons than depression, and were excluded.
Episodes with antidepressants prescribed after 6 days were considered as if patients were being on a watchful waiting/active monitoring and were excluded. [6][7][8] We did not use a specific threshold on a depression scale (e.g. PHQ-9 >5 at baseline) to include participants, as patients in primary care can be treated even with minor symptoms of BMJ Publishing Group Limited (BMJ) disclaims all liability and responsibility arising from any reliance Supplemental material placed on this supplemental material which has been supplied by the author(s)

Exclusion criteria
We excluded: • Episodes of depression which had a previous episode of depression in the year before or a previous prescription of antidepressants in the year before. This is because patients with multiple diagnoses and treatments within the year can be treatmentresistant and would therefore be a different population. 11 • Episodes of depression not associated with a prescription of an antidepressant at baseline (i.e., 6 days before or after the diagnosis).
• Episodes of depression associated with a prescription of more than one antidepressant in the year before (i.e. 365 days before or 6 days after the diagnosis); • Episodes of depression associated with a prescription of antipsychotics in the year before (i.e. 365 days before or 6 days after the diagnosis); • Episodes of depression associated with a prescription of mood stabilisers in the year before (i.e. 365 days before or 6 days after the diagnosis); • Episodes of depression starting within 3 months of delivery (i.e. post-partum depression).
If a patient had multiple episodes of depression recorded, we gave preference to the last episode for each patient.  We also excluded patients with a recorded diagnosis of bipolar disorder or schizophrenia spectrum disorder made at any point before the index episode.

Exposure
The primary exposure of interest was the use of fluoxetine. Information was extracted from all prescriptions for fluoxetine issued during the 3-months follow-up. We calculated the duration of each prescription in days by dividing the number of tablets prescribed by the number of tablets to be taken each day. If the information on tablets per day was missing or not sufficiently detailed (expected to be < 5% of total prescriptions) we estimated the duration of the prescription based on the number of tablets prescribed, as in previous studies. 3 Patients were classified as continually exposed to fluoxetine during periods where there were no gaps of more than 30 days between the end of one prescription and the start of the next (most antidepressants at the beginning of a treatment are prescribed for not more than 28-30 days). Patients were classified as exposed for the first 30 days after the estimated date of stopping fluoxetine in order to account for any delays in starting the prescription or accumulation of tablets as well as to attribute the outcomes occurring during withdrawal periods to the antidepressant, as done in previous studies. Condition-specific variables included baseline depression severity (continuous, we considered PHQ-9 recorded up to two weeks before and 6 days after the index diagnosis of depression as the baseline measurement), previous antidepressant use (yes/no), use of selective serotonin reuptake inhibitor (SSRI) in the past (yes/no), use of fluoxetine in the past (yes/no), previous psychotherapy use (yes/no), previous referral to secondary care (yes/no), childhood maltreatment (yes/no). Note that the use of medications would be more than a year before, exactly. Episodes where patients were prescribed antidepressants in the year before were excluded.

Hyper-parameter tuning of the ridge regression
To tune the regularisation strength hyper-parameter of ridge regression, in other words, to find the best configuration of ridge regression and ensure its performance against unseen data, a grid search from 0.0001 to 100 was conducted. Specifically, for each of the hyperparameter candidate values, we performed 10-fold cross validation on the training data, where we first randomly split the training data into 10 folds. We then took 9 folds to train the ridge regression and tested it on the left-out fold. We looped through all 10 folds and in the end, checked the performance of all models. The hyper-parameter that yielded the best performance was selected to develop the ridge regression on all training samples.

Silhouette coefficient
The Silhouette coefficient is a measure of how similar data points within a cluster are, compared to data points in other clusters. It is calculated as , where a is the mean intra-cluster distance and b is the mean nearest-cluster distance, i.e., the mean distance to the nearest cluster that a sample is not a part of. Of note, "distance" in this context is defined as in the k-means algorithm.

Exploring patterns in the identified clusters
After fitting the similarity-based approach, we examined the identified clusters, to explore patterns among patients therein. For each multiple imputed datasets and for each identified cluster, we summarised the baseline characteristics (which were used for the clustering) as well as the outcomes (not involved in clustering). We did not expect clusters to be identical across the 10 multiple imputed datasets, and in fact, the optimal number of clusters may be different across imputed datasets. However, we hypothesised that the patients and their characteristics in each cluster were similar enough across imputed datasets, with the clustering procedure not being very sensitive to the multiple imputation procedure.