ReviewA systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models
Introduction
Clinical risk prediction models are ubiquitous in many medical domains. These models aim to predict a clinically relevant outcome using person-level information. The traditional approach to develop these models involves the use of regression models, for example, logistic regression (LR) to predict disease presence (diagnosis) or disease outcomes (prognosis) [1]. Machine learning (ML) algorithms are gaining in popularity as an alternative approach for prediction and classification problems. ML methods include artificial neural networks, support vector machines, and random forests [2]. Although ML methods have been sporadically used for clinical prediction for some time [3], [4], the growing availability of increasingly large, voluminous, and rich data sets such as electronic health records data have reignited interest in exploiting these methods [5], [6], [7].
Definitions of what constitutes ML and the differences with statistical modeling have been discussed at length in the literature [8], yet the distinction is not clear-cut [9]. The seminal reference on this issue is Breiman's review of the “two cultures” [8]. Breiman contrasts theory-based models such as regression with empirical algorithms such as decision trees, artificial neural networks, support vector machines, or random forests. A useful definition of ML is that it focuses on models that directly and automatically learn from data [10]. By contrast, regression models are based on theory and assumptions, and benefit from human intervention and subject knowledge for model specification. For example, ML performs modeling more automatically than regression regarding the inclusion of nonlinear associations and interaction terms [11]. To do so, ML algorithms are often highly flexible algorithms that require penalization to avoid overfitting [12]. Some researchers describe the distinction between statistical modeling and ML as a continuum [5]. Other researchers label any method that deviates from basic regression models as ML [13], such as penalized regression (e.g., LASSO, elastic net) or generalized additive models (GAM). We note that these methods do not belong to ML using the “automatic learning from data” definition, and did not classify these as ML in this study.
Owing to its flexibility, ML is claimed to have better performance over traditional statistical modeling, and to better handle a larger number of potential predictors [5], [6], [7], [12], [14], [15], [16]. However, recent research suggested that ML requires more data than LR, which contradicts the above claim [17]. Furthermore, ML models are typically assessed in terms of discrimination performance (e.g., accuracy, area under the receiver operating characteristic [ROC] curve [AUC]), while the reliability of risk predictions (calibration) is often not assessed [18]. The claim of improved performance in clinical prediction is therefore not established.
The primary objective of this study was to compare the performance of LR with ML algorithms for the development of diagnostic or prognostic clinical prediction models for binary outcomes based on clinical data. Secondary objectives were to describe the characteristics of the studies, the type of ML algorithms that were used, the validation process, the modeling aspects of LR and ML, reporting quality, and risk of bias for comparing performance between regression and ML [19].
Section snippets
Materials and methods
The study was registered with PROSPERO (CRD42018068587). We followed the Preferred Reporting Items for Systematic reviews and Meta-Analysis (PRISMA) statement.
Results
Our search identified 927 articles published since between 1/2016 and 8/2017, of which 802 studies were excluded based on title or abstract (Fig. 1). Fifty-four studies were excluded during full-text screening. Seventy-one studies met inclusion criteria and came from a wide variety of clinical domains, with oncology and cardiovascular medicine as the most common (Table A.3–4) [27], [28], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46],
Discussion
Our systematic review of studies that compare clinical prediction models using LR and ML yielded the following key findings. Reporting of methodology and findings was very often incomplete and unclear; model validation procedures still often were poor. Calibration of risk predictions was seldom examined, and AUC performance of LR and ML was on average no different when comparisons had low risk of bias. The latter finding is in line with the claim that traditional approaches often perform
Acknowledgments
This work was supported by the Research Foundation–Flanders (FWO) [grant G0B4716N]; Internal Funds KU Leuven [grant C24/15/037]; Cancer Research UK [grant 5529/A16895]; the NIHR Biomedical Research Centre, Oxford, UK. The funding sources had no role in the conception, design, data collection, analysis, or reporting of this study.
References (113)
Machine learning for medical diagnosis: history, state of the art and perspective
Artif Intell Med
(2001)- et al.
The use of artificial neural networks in decision support in cancer: a systematic review
Neural Netw
(2006) - et al.
A calibration hierarchy for risk models was defined: from utopia to empirical data
J Clin Epidemiol
(2016) - et al.
Internal validation of predictive models: efficiency of some procedures for logistic regression analysis
J Clin Epidemiol
(2001) - et al.
Electronic health record phenotyping improves detection and screening of type 2 diabetes in the general United States population: a cross-sectional, unselected, retrospective study
J Biomed Inform
(2016) - et al.
How can machine-learning methods assist in virtual screening for hyperuricemia? A healthcare machine-learning approach
J Biomed Inform
(2016) - et al.
Falling in the elderly: do statistical models matter for performance criteria of fall prediction? Results from two large population-based studies
Eur J Intern Med
(2016) - et al.
Developing artificial neural network models to predict functioning one year after traumatic spinal cord injury
Arch Phys Med Rehabil
(2016) - et al.
Development of a web-based liver cancer prediction model for type II diabetes patients by using an artificial neural network
Comput Methods Programs Biomed
(2016) - et al.
The use of machine learning for the identification of peripheral artery disease and future mortality risk
J Vasc Surg
(2016)
Modern modeling techniques had limited external validity in predicting mortality from traumatic brain injury
J Clin Epidemiol
Predicting distant failure in early stage NSCLC treated with SBRT using clinical parameters Predicting distant failure in lung SBRT
Radiother Oncol
Validating the usefulness of the “random forests” classifier to diagnose early glaucoma with optical coherence tomography
Am J Ophthalmol
Designing predictive models for beta-lactam allergy using the drug allergy and hypersensitivity database
J Allergy Clin Immunol Pract
Normal tissue complication probability (NTCP) modelling of severe acute mucositis using a novel oral mucosal surface organ at risk
Clin Oncol
Predicting risk for portal vein thrombosis in acute pancreatitis patients: a comparison of radical basis function artificial neural network and logistic regression models
J Crit Care
Artificial neural networks predict the incidence of portosplenomesenteric venous thrombosis in patients with acute pancreatitis
J Thromb Haemost
Predicting the incidence of portosplenomesenteric vein thrombosis in patients with acute pancreatitis using classification and regression tree algorithm
J Crit Care
Classification of suicide attempters in schizophrenia using sociocultural and clinical features: a machine learning approach
Gen Hosp Psychiatry
Predicting return visits to the emergency department for pediatric patients: applying supervised learning techniques to the Taiwan National Health Insurance Research Database
Comput Methods Programs Biomed
Subgroup identification of early preterm birth (ePTB): informing a future prospective enrichment clinical trial design
BMC Pregnancy Childbirth
Different medical data mining approaches based prediction of ischemic stroke
Comput Methods Programs Biomed
Normal tissue complication probability (NTCP) modelling using spatial dose metrics and machine learning methods for severe acute oral mucositis resulting from head and neck radiotherapy
Radiother Oncol
Reporting and interpreting decision curve analysis: a guide for investigators
Eur Urol
Clinical prediction models
The elements of statistical learning: data mining, inference, and prediction
Big data and machine learning in health care
JAMA
Machine learning and prediction in medicine — beyond the peak of inflated expectations
N Engl J Med
Moving beyond regression techniques in cardiovascular risk prediction: applying machine learning to address analytic challenges
Eur Heart J
Statistical modeling: the two cultures (with comments and a rejoinder by the author)
Stat Sci
Critical appraisal and data extraction for systematic reviews of prediction modelling studies: the CHARMS checklist
PLoS Med
Machine learning
Machine learning versus statistical modeling
Biom J
Learning about machine learning: the promise and pitfalls of big data and the electronic health record
Circ Cardiovasc Qual Outcomes
Learning from imbalanced data
IEEE Trans Knowl Data Eng
Support vector machines versus logistic regression: improving prospective performance in clinical decision-making
Ultrasound Obstet Gynecol
Scalable and accurate deep learning for electronic health records
NPJ Digit Med
Guidelines for developing and reporting machine learning predictive models in biomedical research: a multidisciplinary view
J Med Internet Res
Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints
BMC Med Res Methodol
Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement
J Clin Epidemiol
A plea for neutral comparison studies in computational sciences
PLoS One
Classifier technology and the illusion of progress
Stat Sci
QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies
Ann Intern Med
Tunability: importance of hyperparameters of machine learning algorithms
Quantifying the impact of different approaches for handling continuous predictors on the performance of a prognostic model
Stat Med
The statistical evaluation of medical tests for classification and prediction
Artificial neural networks versus bivariate logistic regression in prediction diagnosis of patients with hypertension and diabetes
Med J Islam Repub Iran
Predicting ventriculoperitoneal shunt infection in children with hydrocephalus using artificial neural network
Childs Nerv Syst
Comparison of predictive models for the early diagnosis of diabetes
Healthc Inform Res
Prediction and detection models for acute kidney injury in hospitalized older adults
BMC Med Inform Decis Mak
Cited by (989)
Detecting abnormal behaviors in smart contracts using opcode sequences
2024, Computer Communications