A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models

doi:10.1016/j.jclinepi.2019.02.004

Journal of Clinical Epidemiology

Volume 110, June 2019, Pages 12-22

https://doi.org/10.1016/j.jclinepi.2019.02.004 Get rights and content

Abstract

Objectives

The objective of this study was to compare performance of logistic regression (LR) with machine learning (ML) for clinical prediction modeling in the literature.

Study Design and Setting

We conducted a Medline literature search (1/2016 to 8/2017) and extracted comparisons between LR and ML models for binary outcomes.

Results

We included 71 of 927 studies. The median sample size was 1,250 (range 72–3,994,872), with 19 predictors considered (range 5–563) and eight events per predictor (range 0.3–6,697). The most common ML methods were classification trees, random forests, artificial neural networks, and support vector machines. In 48 (68%) studies, we observed potential bias in the validation procedures. Sixty-four (90%) studies used the area under the receiver operating characteristic curve (AUC) to assess discrimination. Calibration was not addressed in 56 (79%) studies. We identified 282 comparisons between an LR and ML model (AUC range, 0.52–0.99). For 145 comparisons at low risk of bias, the difference in logit(AUC) between LR and ML was 0.00 (95% confidence interval, −0.18 to 0.18). For 137 comparisons at high risk of bias, logit(AUC) was 0.34 (0.20–0.47) higher for ML.

Conclusion

We found no evidence of superior performance of ML over LR. Improvements in methodology and reporting are needed for studies that compare modeling algorithms.

Introduction

Clinical risk prediction models are ubiquitous in many medical domains. These models aim to predict a clinically relevant outcome using person-level information. The traditional approach to develop these models involves the use of regression models, for example, logistic regression (LR) to predict disease presence (diagnosis) or disease outcomes (prognosis) [1]. Machine learning (ML) algorithms are gaining in popularity as an alternative approach for prediction and classification problems. ML methods include artificial neural networks, support vector machines, and random forests [2]. Although ML methods have been sporadically used for clinical prediction for some time [3], [4], the growing availability of increasingly large, voluminous, and rich data sets such as electronic health records data have reignited interest in exploiting these methods [5], [6], [7].

Definitions of what constitutes ML and the differences with statistical modeling have been discussed at length in the literature [8], yet the distinction is not clear-cut [9]. The seminal reference on this issue is Breiman's review of the “two cultures” [8]. Breiman contrasts theory-based models such as regression with empirical algorithms such as decision trees, artificial neural networks, support vector machines, or random forests. A useful definition of ML is that it focuses on models that directly and automatically learn from data [10]. By contrast, regression models are based on theory and assumptions, and benefit from human intervention and subject knowledge for model specification. For example, ML performs modeling more automatically than regression regarding the inclusion of nonlinear associations and interaction terms [11]. To do so, ML algorithms are often highly flexible algorithms that require penalization to avoid overfitting [12]. Some researchers describe the distinction between statistical modeling and ML as a continuum [5]. Other researchers label any method that deviates from basic regression models as ML [13], such as penalized regression (e.g., LASSO, elastic net) or generalized additive models (GAM). We note that these methods do not belong to ML using the “automatic learning from data” definition, and did not classify these as ML in this study.

Owing to its flexibility, ML is claimed to have better performance over traditional statistical modeling, and to better handle a larger number of potential predictors [5], [6], [7], [12], [14], [15], [16]. However, recent research suggested that ML requires more data than LR, which contradicts the above claim [17]. Furthermore, ML models are typically assessed in terms of discrimination performance (e.g., accuracy, area under the receiver operating characteristic [ROC] curve [AUC]), while the reliability of risk predictions (calibration) is often not assessed [18]. The claim of improved performance in clinical prediction is therefore not established.

The primary objective of this study was to compare the performance of LR with ML algorithms for the development of diagnostic or prognostic clinical prediction models for binary outcomes based on clinical data. Secondary objectives were to describe the characteristics of the studies, the type of ML algorithms that were used, the validation process, the modeling aspects of LR and ML, reporting quality, and risk of bias for comparing performance between regression and ML [19].

Section snippets

Materials and methods

The study was registered with PROSPERO (CRD42018068587). We followed the Preferred Reporting Items for Systematic reviews and Meta-Analysis (PRISMA) statement.

Results

Our search identified 927 articles published since between 1/2016 and 8/2017, of which 802 studies were excluded based on title or abstract (Fig. 1). Fifty-four studies were excluded during full-text screening. Seventy-one studies met inclusion criteria and came from a wide variety of clinical domains, with oncology and cardiovascular medicine as the most common (Table A.3–4) [27], [28], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46],

Discussion

Our systematic review of studies that compare clinical prediction models using LR and ML yielded the following key findings. Reporting of methodology and findings was very often incomplete and unclear; model validation procedures still often were poor. Calibration of risk predictions was seldom examined, and AUC performance of LR and ML was on average no different when comparisons had low risk of bias. The latter finding is in line with the claim that traditional approaches often perform

Acknowledgments

This work was supported by the Research Foundation–Flanders (FWO) [grant G0B4716N]; Internal Funds KU Leuven [grant C24/15/037]; Cancer Research UK [grant 5529/A16895]; the NIHR Biomedical Research Centre, Oxford, UK. The funding sources had no role in the conception, design, data collection, analysis, or reporting of this study.

References (113)

I. Kononenko
Machine learning for medical diagnosis: history, state of the art and perspective
Artif Intell Med
(2001)
P.J. Lisboa et al.
The use of artificial neural networks in decision support in cancer: a systematic review
Neural Netw
(2006)
B. Van Calster et al.
A calibration hierarchy for risk models was defined: from utopia to empirical data
J Clin Epidemiol
(2016)
E.W. Steyerberg et al.
Internal validation of predictive models: efficiency of some procedures for logistic regression analysis
J Clin Epidemiol
(2001)
A.E. Anderson et al.
Electronic health record phenotyping improves detection and screening of type 2 diabetes in the general United States population: a cross-sectional, unselected, retrospective study
J Biomed Inform
(2016)
D. Ichikawa et al.
How can machine-learning methods assist in virtual screening for hyperuricemia? A healthcare machine-learning approach
J Biomed Inform
(2016)
A. Kabeshova et al.
Falling in the elderly: do statistical models matter for performance criteria of fall prediction? Results from two large population-based studies
Eur J Intern Med
(2016)
T. Belliveau et al.
Developing artificial neural network models to predict functioning one year after traumatic spinal cord injury
Arch Phys Med Rehabil
(2016)
H.H. Rau et al.
Development of a web-based liver cancer prediction model for type II diabetes patients by using an artificial neural network
Comput Methods Programs Biomed
(2016)
E.G. Ross et al.
The use of machine learning for the identification of peripheral artery disease and future mortality risk
J Vasc Surg
(2016)

The elements of statistical learning: data mining, inference, and prediction

(2009)

A.L. Beam et al.

Big data and machine learning in health care

JAMA

(2018)

J.H. Chen et al.

Machine learning and prediction in medicine — beyond the peak of inflated expectations

N Engl J Med

(2017)

B.A. Goldstein et al.

Moving beyond regression techniques in cardiovascular risk prediction: applying machine learning to address analytic challenges

Eur Heart J

(2017)

L. Breiman

Statistical modeling: the two cultures (with comments and a rejoinder by the author)

Stat Sci

(2001)

K.G.M. Moons et al.

Critical appraisal and data extraction for systematic reviews of prediction modelling studies: the CHARMS checklist

PLoS Med

(2014)

T.M. Mitchell

Machine learning

(1997)

A.L. Boulesteix et al.

Machine learning versus statistical modeling

Biom J

(2014)

R.C. Deo et al.

Learning about machine learning: the promise and pitfalls of big data and the electronic health record

Circ Cardiovasc Qual Outcomes

(2016)

H. He et al.

Learning from imbalanced data

IEEE Trans Knowl Data Eng

(2008)

N.L.M.M. Pochet et al.

Support vector machines versus logistic regression: improving prospective performance in clinical decision-making

Ultrasound Obstet Gynecol

(2006)

A. Rajkomar et al.

Scalable and accurate deep learning for electronic health records

NPJ Digit Med

(2018)

W. Luo et al.

Guidelines for developing and reporting machine learning predictive models in biomedical research: a multidisciplinary view

J Med Internet Res

(2016)

T. van der Ploeg et al.

Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints

BMC Med Res Methodol

(2014)

G.S. Collins et al.

Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement

J Clin Epidemiol

(2015)

A.L. Boulesteix et al.

A plea for neutral comparison studies in computational sciences

PLoS One

(2013)

D.J. Hand

Classifier technology and the illusion of progress

Stat Sci

(2006)

P.F. Whiting et al.

QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies

Ann Intern Med

(2011)

P. Probst et al.

Tunability: importance of hyperparameters of machine learning algorithms

(2018)

G.S. Collins et al.

Quantifying the impact of different approaches for handling continuous predictors on the performance of a prognostic model

Stat Med

(2016)

M.S. Pepe

The statistical evaluation of medical tests for classification and prediction

(2003)

M. Adavi et al.

Artificial neural networks versus bivariate logistic regression in prediction diagnosis of patients with hypertension and diabetes

Med J Islam Repub Iran

(2016)

Z. Habibi et al.

Predicting ventriculoperitoneal shunt infection in children with hydrocephalus using artificial neural network

Childs Nerv Syst

(2016)

M. Jahani et al.

Comparison of predictive models for the early diagnosis of diabetes

Healthc Inform Res

(2016)

R.J. Kate et al.

Prediction and detection models for acute kidney injury in hospitalized older adults

BMC Med Inform Decis Mak

(2016)

Cited by (989)

Response letter to the editor—Original manuscript: Machine learning versus logistic regression for the prediction of complication after pancreatoduodenectomy
2024, Surgery (United States)
Deep convolutional neural networks with genetic algorithm-based synthetic minority over-sampling technique for improved imbalanced data classification
2024, Applied Soft Computing
Imbalanced data classification presents a challenge in machine learning, inducing biased model learning. Moreover, data dimensionality poses another challenge as it highly impacts classifier performance. This paper proposes a new deep-learning method that combines feature selection with oversampling to address these challenges. The proposed approach, GA-SMOTE-DCNN, integrates a genetic algorithm (GA) for feature selection, SMOTE for oversampling, and a deep 1D-convolutional neural network (DCNN) for classification. This study reveals that pre-splitting the data into training and testing sets before applying SMOTE results in higher accuracy, showing an improvement in accuracy ranging between 1.94% and 3.98% compared to post-SMOTE splitting for each dataset. This method achieved accuracy rates of 86.81% for the Balance Scale dataset, 86.15% for the Oil Spill dataset, 89.21% for the Yeast dataset, 91.32% for the Mammography dataset, 88.23% for the Australian credit dataset, and 89.53% for the German Credit dataset when compared with benchmark methods, underscoring its significance in tackling high-dimensional and imbalanced data classification problems. This method demonstrates scalability in effectively addressing challenges associated with high-dimensional and imbalanced data classification across various domains.
Detecting abnormal behaviors in smart contracts using opcode sequences
2024, Computer Communications
With the fast growth of blockchain technology, blockchain as a decentralized distributed ledger technology has become more widely used and is gradually changing our way of life. But it also raises more and more security issues. As there are more and more smart contracts on the blockchain, and smart contracts cannot be changed once they are added to the blockchain, there is an opportunity for hackers to attack smart contracts. If not handled properly, it will cause serious economic losses to users. In this paper, we introduce a unique method for identifying abnormal behaviors of smart contract vulnerabilities using opcode sequences. We aim to identify the control flow paths triggered by transactions to capture the abnormal behaviors of smart contracts. The control flow paths are the traces on which the transaction is executed. Using Geth instrumentation, we collect the opcode sequences executed on the traces to represent the control flow paths. It should be noted that the process of detecting abnormal behaviors introduces some additional time overhead. However, our experimental results show that this method achieves high abnormal detection accuracy with minimal overhead. This suggests that our proposed method is effective in identifying potential security issues in smart contracts without significantly impacting the overall execution time.
Seismic behavior assessment of fasteners based on a shaking table test of a museum building model with a display case and artifacts
2024, Engineering Structures
Fasteners are widely used in museums as an attachment method to secure objects. However, due to the lack of research on the seismic behaviour of fasteners under seismic loads, the implementation of fasteners in museums usually lacks scientific guidance, resulting in their ineffectiveness in improving the seismic safety of museum artifacts. In addition, few studies have experimentally investigated the adverse effects of structural dynamic amplification on the seismic safety of museum artifacts. Therefore, this study aimed to assess the seismic behavior of fasteners used in museums and provide valuable insights for protecting museum artifacts. Shaking table tests were conducted on a full-scale model of a museum building, display case, and artifacts to simulate seismic loads and investigate the seismic response of two typical artifacts secured with fasteners. A numerical study was also performed to evaluate different types of fasteners and their effectiveness in enhancing the seismic safety of artifacts. The adverse effects of dynamic amplification on artifact safety were discussed, and recommendations were provided for improving the seismic behavior of fasteners. Furthermore, the impact of multidirectional seismic excitation on the safety of the artifacts and the seismic performance of the fasteners were analyzed. The research findings contribute to the selection of appropriate fasteners for preserving valuable artifacts in museums.
An evaluation of synthetic data augmentation for mitigating covariate bias in health data
2024, Patterns
Data bias is a major concern in biomedical research, especially when evaluating large-scale observational datasets. It leads to imprecise predictions and inconsistent estimates in standard regression models. We compare the performance of commonly used bias-mitigating approaches (resampling, algorithmic, and post hoc approaches) against a synthetic data-augmentation method that utilizes sequential boosted decision trees to synthesize under-represented groups. The approach is called synthetic minority augmentation (SMA). Through simulations and analysis of real health datasets on a logistic regression workload, the approaches are evaluated across various bias scenarios (types and severity levels). Performance was assessed based on area under the curve, calibration (Brier score), precision of parameter estimates, confidence interval overlap, and fairness. Overall, SMA produces the closest results to the ground truth in low to medium bias (50% or less missing proportion). In high bias (80% or more missing proportion), the advantage of SMA is not obvious, with no specific method consistently outperforming others.
Development and validation of a stacking ensemble model for death prediction in the Chinese Longitudinal Healthy Longevity Survey (CLHLS)
2024, Maturitas
This study aimed to develop and validate a mortality risk prediction model for older people based on the Chinese Longitudinal Healthy Longevity Survey using the stacking ensemble strategy.
A total of 12,769 participants aged 65 or more at baseline were included. Ensemble machine learning models were applied to develop a mortality prediction model. We selected three base learners, including logistic regression, eXtreme Gradient Boosting, and Categorical + Boosting, and used logistic regression as the meta-learner. The primary outcome was five-year survival. Variable importance was evaluated by the SHapley Additive exPlanations method.
The mean age at baseline was 88, and 57.8 % of participants were women. The CatBoost model performed the best among the three base learners, the area under the receiver operating characteristics curve (AUC) reached 0.8469 (95%CI: 0.8345–0.8593), and the stacking ensemble model further improved the discrimination ability (AUC = 0.8486, 95%CI: 0.8367–0.8612, P = 0.046). Conventional logistic regression had comparable performance (AUC = 0.8470, 95 % CI: 0.8346–0.8595). Older age, higher scores for self-care activities of daily living, being male, higher objective physical performance capacity scores, not undertaking housework, and lower scores on the Mini-Mental State Examination contributed to higher risk.
We successfully constructed and validated a few death risk prediction models for a Chinese population of older adults. While the stacking ensemble approach had the best prediction performance, the improvement over conventional logistic regression was insubstantial.

View all citing articles on Scopus

View full text

ReviewA systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models

Abstract

Objectives

Study Design and Setting

Results

Conclusion

Introduction

Section snippets

Materials and methods

Results

Discussion

Acknowledgments

Artif Intell Med

Neural Netw

J Clin Epidemiol

J Clin Epidemiol

J Biomed Inform

J Biomed Inform

Eur J Intern Med

Arch Phys Med Rehabil

Comput Methods Programs Biomed

J Vasc Surg

J Clin Epidemiol

Radiother Oncol

Am J Ophthalmol

J Allergy Clin Immunol Pract

Clin Oncol

J Crit Care

J Thromb Haemost

J Crit Care

Gen Hosp Psychiatry

Comput Methods Programs Biomed

BMC Pregnancy Childbirth

Comput Methods Programs Biomed

Radiother Oncol

Eur Urol

Clinical prediction models

The elements of statistical learning: data mining, inference, and prediction

Big data and machine learning in health care

JAMA

Machine learning and prediction in medicine — beyond the peak of inflated expectations

N Engl J Med

Moving beyond regression techniques in cardiovascular risk prediction: applying machine learning to address analytic challenges

Eur Heart J

Statistical modeling: the two cultures (with comments and a rejoinder by the author)

Stat Sci

Critical appraisal and data extraction for systematic reviews of prediction modelling studies: the CHARMS checklist

PLoS Med

Machine learning

Machine learning versus statistical modeling

Biom J

Learning about machine learning: the promise and pitfalls of big data and the electronic health record

Circ Cardiovasc Qual Outcomes

Learning from imbalanced data

IEEE Trans Knowl Data Eng

Support vector machines versus logistic regression: improving prospective performance in clinical decision-making

Ultrasound Obstet Gynecol

Scalable and accurate deep learning for electronic health records

NPJ Digit Med

Guidelines for developing and reporting machine learning predictive models in biomedical research: a multidisciplinary view

J Med Internet Res

Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints

BMC Med Res Methodol

Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement

J Clin Epidemiol

A plea for neutral comparison studies in computational sciences

PLoS One

Classifier technology and the illusion of progress

Stat Sci

QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies

Ann Intern Med

Tunability: importance of hyperparameters of machine learning algorithms

Quantifying the impact of different approaches for handling continuous predictors on the performance of a prognostic model

Stat Med

The statistical evaluation of medical tests for classification and prediction

Artificial neural networks versus bivariate logistic regression in prediction diagnosis of patients with hypertension and diabetes

Med J Islam Repub Iran

Predicting ventriculoperitoneal shunt infection in children with hydrocephalus using artificial neural network

Childs Nerv Syst

Comparison of predictive models for the early diagnosis of diabetes

Review
A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models