Assessing violence risk in first-episode psychosis: external validation, updating and net benefit of a prediction tool (OxMIV)

Background Violence perpetration is a key outcome to prevent for an important subgroup of individuals presenting to mental health services, including early intervention in psychosis (EIP) services. Needs and risks are typically assessed without structured methods, which could facilitate consistency and accuracy. Prediction tools, such as OxMIV (Oxford Mental Illness and Violence tool), could provide a structured risk stratification approach, but require external validation in clinical settings. Objectives We aimed to validate and update OxMIV in first-episode psychosis and consider its benefit as a complement to clinical assessment. Methods A retrospective cohort of individuals assessed in two UK EIP services was included. Electronic health records were used to extract predictors and risk judgements made by assessing clinicians. Outcome data involved police and healthcare records for violence perpetration in the 12 months post-assessment. Findings Of 1145 individuals presenting to EIP services, 131 (11%) perpetrated violence during the 12 month follow-up. OxMIV showed good discrimination (area under the curve 0.75, 95% CI 0.71 to 0.80). Calibration-in-the-large was also good after updating the model constant. Using a 10% cut-off, sensitivity was 71% (95% CI 63% to 80%), specificity 66% (63% to 69%), positive predictive value 22% (19% to 24%) and negative predictive value 95% (93% to 96%). In contrast, clinical judgement sensitivity was 40% and specificity 89%. Decision curve analysis showed net benefit of OxMIV over comparison approaches. Conclusions OxMIV performed well in this real-world validation, with improved sensitivity compared with unstructured assessments. Clinical implications Structured tools to assess violence risk, such as OxMIV, have potential in first-episode psychosis to support a stratified approach to allocating non-harmful interventions to individuals who may benefit from the largest absolute risk reduction.

Explain the medical context (including whether diagnostic or prognostic) and rationale for developing or validating the multivariable prediction model, including references to existing models.

5-7 3b
Specify the objectives, including whether the study describes the development or validation of the model or both. 1-7

Source of data 4a
Describe the study design or source of data (e.g., randomized trial, cohort, or registry data), separately for the development and validation data sets, if applicable. 7 4b Specify the key study dates, including start of accrual; end of accrual; and, if applicable, end of follow-up. 7

Participants 5a
Specify key elements of the study setting (e.g., primary care, secondary care, general population) including number and location of centres. 7 5b Describe eligibility criteria for participants. 7 5c Give details of treatments received, if relevant. n/a Outcome 6a Clearly define the outcome that is predicted by the prediction model, including how and when assessed. 8 6b Report any actions to blind assessment of the outcome to be predicted. 8

Predictors 7a
Clearly define all predictors used in developing or validating the multivariable prediction model, including how and when they were measured. 8 7b Report any actions to blind assessment of predictors for the outcome and other predictors. 8 Sample size 8 Explain how the study size was arrived at. 9 Missing data 9 Describe how missing data were handled (e.g., complete-case analysis, single imputation, multiple imputation) with details of any imputation method.     Parental lifetime diagnosis of drug or alcohol use disorder (definitions as above for personal history).

Yes / No
Parental violent crime (ParentViol) Parental lifetime conviction for a violent offence (defined as above for personal history). History of incarceration taken as a proxy of violent offending.

Yes / No
Sibling violent crime (SibViol) Sibling lifetime conviction for a violent offence (defined as above). History of incarceration taken as a proxy of violent offending.

Current episode (Episode)
Inpatient hospital admission or outpatient community patient at point of assessment.

Recent antipsychotic treatment (Antipsych)
Prescribed and taken any antipsychotic drug in 6 months before assessment.

Recent antidepressant treatment (Antidep)
Prescribed and taken any antidepressant drug in 6 months before assessment.

Yes / No
Recent dependence treatment (Dependence) Any pharmacological strategy to treat dependence (e.g. replacement therapy such as methadone) prescribed and taken in 6 months before assessment.

Yes / No
Personal income* Low: unemployed and/or inadequate financial situation Low / Stable

Choice of violent outcome measure
Officially registered violent outcomes such as arrest or conviction offer an advantage over measures that rely on ongoing contact with mental health services, as this engagement may be disrupted as a consequence of developing the outcome. Official outcomes also avoid limitations of self-report scales. Conviction is the most robust measure, however there are practical limitations in a psychiatric population. Police-recorded occurrences of an individual having been arrested, charged and/or a suspect for a violent offence were used as these objectively capture a significant incident that is stressful, disruptive and stigmatising, and also consume both police resource and healthcare resource for potential future care. Relevant offence codes were pre-specified (Supplementary Table 2). Codes for offences pertaining to possession alone of a weapon (i.e. without any associated threats or assaultive behaviour as defined above) were not included, as these can occur secondary to proactive policing measures. To be included, the date of the incident (rather than the date of arrest or charge) must have been within 12 months of the EIP assessment.
It is also well recognised that significant violent incidents may occur which do not lead to police contact. Typically these can occur during acute-phase illness, such as during an inpatient hospital admission, where normal criminal proceedings may not be pursued.
Nevertheless, such incidents are of clinical importance. To avoid missing such outcomes therefore, police data was supplemented by a focussed review of the 12 month period in question in the EHR to identify any relevant documented incidents. The threshold of severity for this was pre-specified as involving a weapon or documented physical injury.

Clinical judgement comparator
If the summary did not mention risk to others, or referred to risk as "low" (including synonyms such as "absent", "not identified", "nil significant", "no concern"), this was recorded as "not increased". Alternatively, if 1) there was any documentation that risk to others was "moderate" or "high" (including synonyms "elevated", "escalating", "concerning", "acute", "unsafe"), 2) the categorical option in the risk assessment template was marked "risk identified", or 3) there were features within the management plan specifically addressing violence risk (for example documentation of safety advice given to a relative on how to respond in the event of aggression), then risk was recorded as "increased".

Data extraction
If there was insufficient information to code a predictor, this was recorded "NA". Data was extracted by one researcher (DW), and for a random sample of 20 individuals also blindly by a second researcher. Concordance was 90%-100% for all but three predictors (highest education [85%], sibling violent crime [85%] and parental violent crime [75%]). Discordance was around the threshold to code "NA" and was resolvable through discussion.

Missing data
The approach of multiple imputation by chained equations is the recommended approach in current statistical literature, and even allows external validation in cohorts that do not include all the original predictors [2]. Alternatives such as complete-case analysis are recognised as unnecessarily inefficient as they can disregard a substantial amount of available data and introduce unwanted selection bias because only patients with complete data can be included in analysis [3]. Further, entirely excluding predictors with missing data from models would the outcome. The plausibility of imputed datasets was checked by visually examining the distribution of predictors in observed and imputed data using stacked bar charts. The ability of pooling results from multiple imputations to give unbiased performance estimates and standard errors is well described [4].

Discrimination
Discrimination is the ability of a model to separate (i.e. order correctly) any random pair of individuals with and without the outcome, and was quantified with the AUC (area under the receiver operating characteristic [ROC] curve), which for a binary outcome is identical to the concordance or c-statistic. [5] An AUC of 1 represents perfect discrimination, whereas 0.5 indicates a model that performs no better than chance.

Calibration
Whilst discrimination is important, for a tool to be clinically usable it must also be adequately calibrated in external validation. Calibration is how well predictions match the observed data, i.e. do X% of individuals with a predicted risk of X% develop the outcome? Examining calibration in the current study is especially important, as the event rate was expected to differ from that in the development dataset.
Calibration was examined with 1) calibration-in-the-large (CITL) [6], which estimates the difference between the mean predicted outcomes and the mean observed outcomes where CITL<0 indicates that predicted probabilities are higher than observed proportions and vice versa, 2) the expected/observed ratio (E/O) [7], related to CITL but more intuitively summarising the ratio of the number predicted to have the outcome to the number observed to have the outcome, where 1 is perfect, 3) the calibration slope [6], where a calibration model is fitted and the slope coefficient indicates the degree of deviation from a perfect slope of 1 (<1 for an over-fitted model with predictions that are too extreme at low and high probabilities, and >1 for an under-fitted model), and 4) visual inspection of calibration plots, with predicted and observed probabilities plotted in decile risk groups. For overall model performance (encompassing both discrimination and calibration), the Brier score was calculated [5]. This is the squared difference between binary outcomes and predictions, with range from 0 to 1 with better scores being closer to 0.

Model updating
A stepwise approach was pre-specified [8]. First, baseline risk was adjusted by updating the model constant, bringing predicted and observed outcomes in line without affecting discrimination or predictor coefficients. Next, the calibration slope was used as a rescaling factor. The improvement in performance was examined to decide whether this rescaled model should be adopted in favour of the model with only the updated intercept (Supplementary

Clinical benefit
Measures of discrimination or calibration alone give limited information on clinical value [9,10]. This was addressed by examining net benefit with decision curve analysis (DCA) [11][12][13], informed by recent clarifications of its interpretation [14,15]. This specifies an "exchange rate" which is the number of false positive predictions that are acceptable to "treat" one true positive. This subjective preference depends on the relative harms of misclassification and is plotted across a range. The net benefit of OxMIV and unstructured clinical judgement were compared to baseline strategies of "treating none" or "treating all", as recommended for interpretation [14]. Intervening for all or no patients, irrespective of model results, may be reasonable clinical strategies and so, to justify clinical use, a model must be superior to both. For example, a poorly performing model may offer lower net benefit than offering a non-harmful intervention to all. For consistency and interpretation, for OxMIV, the exchange rate is the same as the cutoff for classifying "increased" risk. Analyses were undertaken with R version 4.1 [16].

Model updating and statistical analyses
The OxMIV model is expressed by the equation: Where Y is the linear predictor which is then converted to a probability scale by the logit link To update the intercept of the model, linear predictors were first calculated with the intercept term removed, and then a logistic model fitted with these new linear predictors as the offset term. To update both intercept and slope, a logistic model was fitted using the linear predictors with the intercept removed as the only predictor.
Confidence intervals were calculated with the exact, conservative Clopper Pearson method for sensitivity/specificity and asymptomatic logit intervals for predictive values [17,18].
Confidence intervals were also reported for the incidence of violence in the cohort using the Wilson method [19], as this may inform future work.
Decision curve analysis was undertaken using the rmda package [20]. Missing data was visualised with the packages Hmisc [21], mice [22] and VIM [23], and generalised linear models were fitted to examine associations with missingness of a predictor as a binary dependent variable. The mice [22] package was used for multiple imputation and subsequent pooled analyses, specifying logistic regression for imputation of binary data and a multinomial logit model for imputation of education (factor with >2 levels

Public and patient involvement
Efforts were made to test the methodology of this study with a wide range of public and patient representatives. The primary focus of this public and patient involvement was to test the acceptability of the methodology of using routine health record data. Input also assisted in producing the patient-facing documentation, which was limited to a poster communicating the study activity and the mechanism for dissent that was displayed in relevant patient areas.
Over the course of the different stages of design and approval, 17 individuals were involved.
This included carers and individuals with personal experience of EIP and forensic mental health services. There was positivity around the potential clinical applicability of the work.
The importance of the information security safeguards, particularly the interface with police, was highlighted by representatives, who were reassured by the formal frameworks in place.

Missing data
In 950 cases (83%), two or fewer of the 16 predictors were missing, and 471 (41%) were complete cases with no missing data. Examining the patterns of missingness among predictors showed that the family history items tended to be missing together; the most frequent overlap was for sibling violence and parental violence, which were both missing in 32% of cases. Clinically, it is plausible that the missingness of these two family history items depended on the level of detail of the assessment, rather than the missing value itself (i.e. a missing at random [MAR] pattern). Further, it would be expected that family history items would be less complete in older individuals (an observed variable), and more complete in individuals whose reference episode is a hospital admission (also an observed variable), as this typically means a fuller admission summary is documented with mandatory components.
Margin plots showing the distribution of age when each of four predictors were either observed or missing (sibling violence, parental violence, parental drug and alcohol use and education) were in line with individuals with unobserved predictors tending to be older, in keeping with MAR. Generalised linear models showed that missingness of education and the three family history items was positively associated with age (i.e. older individuals were more likely to have these predictors missing, likely due to not being assessed to the same level of detail as younger persons for these factors), violating the missing completely at random (MCAR) assumption but in keeping with MAR. Calibration plot with 10 risk groups, in an imputed dataset where OxMIV performs in the mid-range, for original model (a), and model with updated intercept and rescaled coefficients (b). Histogram above legend represents distribution density of predicted risks within dataset.