Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study

Abstract We present a simple and efficient hypothesis-free machine learning pipeline for risk factor discovery that accounts for non-linearity and interaction in large biomedical databases with minimal variable pre-processing. In this study, mortality models were built using gradient boosting decisi...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autores principales:	Iqbal Madakkatel, Ang Zhou, Mark D. McDonnell, Elina Hyppönen
Formato:	article
Lenguaje:	EN
Publicado:	Nature Portfolio 2021
Materias:	Medicine R Science Q
Acceso en línea:	https://doaj.org/article/8d07dd6dd43b4072bad55c2c9fa43b2b
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

id	oai:doaj.org-article:8d07dd6dd43b4072bad55c2c9fa43b2b
record_format	dspace
spelling	oai:doaj.org-article:8d07dd6dd43b4072bad55c2c9fa43b2b2021-11-28T12:16:09ZCombining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study10.1038/s41598-021-02476-92045-2322https://doaj.org/article/8d07dd6dd43b4072bad55c2c9fa43b2b2021-11-01T00:00:00Zhttps://doi.org/10.1038/s41598-021-02476-9https://doaj.org/toc/2045-2322Abstract We present a simple and efficient hypothesis-free machine learning pipeline for risk factor discovery that accounts for non-linearity and interaction in large biomedical databases with minimal variable pre-processing. In this study, mortality models were built using gradient boosting decision trees (GBDT) and important predictors were identified using a Shapley values-based feature attribution method, SHAP values. Cox models controlled for false discovery rate were used for confounder adjustment, interpretability, and further validation. The pipeline was tested using information from 502,506 UK Biobank participants, aged 37–73 years at recruitment and followed over seven years for mortality registrations. From the 11,639 predictors included in GBDT, 193 potential risk factors had SHAP values ≥ 0.05, passed the correlation test, and were selected for further modelling. Of the total variable importance summed up, 60% was directly health related, and baseline characteristics, sociodemographics, and lifestyle factors each contributed about 10%. Cox models adjusted for baseline characteristics, showed evidence for an association with mortality for 166 out of the 193 predictors. These included mostly well-known risk factors (e.g., age, sex, ethnicity, education, material deprivation, smoking, physical activity, self-rated health, BMI, and many disease outcomes). For 19 predictors we saw evidence for an association in the unadjusted but not adjusted analyses, suggesting bias by confounding. Our GBDT-SHAP pipeline was able to identify relevant predictors ‘hidden’ within thousands of variables, providing an efficient and pragmatic solution for the first stage of hypothesis free risk factor identification.Iqbal MadakkatelAng ZhouMark D. McDonnellElina HyppönenNature PortfolioarticleMedicineRScienceQENScientific Reports, Vol 11, Iss 1, Pp 1-11 (2021)
institution	DOAJ
collection	DOAJ
language	EN
topic	Medicine R Science Q
spellingShingle	Medicine R Science Q Iqbal Madakkatel Ang Zhou Mark D. McDonnell Elina Hyppönen Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study
description	Abstract We present a simple and efficient hypothesis-free machine learning pipeline for risk factor discovery that accounts for non-linearity and interaction in large biomedical databases with minimal variable pre-processing. In this study, mortality models were built using gradient boosting decision trees (GBDT) and important predictors were identified using a Shapley values-based feature attribution method, SHAP values. Cox models controlled for false discovery rate were used for confounder adjustment, interpretability, and further validation. The pipeline was tested using information from 502,506 UK Biobank participants, aged 37–73 years at recruitment and followed over seven years for mortality registrations. From the 11,639 predictors included in GBDT, 193 potential risk factors had SHAP values ≥ 0.05, passed the correlation test, and were selected for further modelling. Of the total variable importance summed up, 60% was directly health related, and baseline characteristics, sociodemographics, and lifestyle factors each contributed about 10%. Cox models adjusted for baseline characteristics, showed evidence for an association with mortality for 166 out of the 193 predictors. These included mostly well-known risk factors (e.g., age, sex, ethnicity, education, material deprivation, smoking, physical activity, self-rated health, BMI, and many disease outcomes). For 19 predictors we saw evidence for an association in the unadjusted but not adjusted analyses, suggesting bias by confounding. Our GBDT-SHAP pipeline was able to identify relevant predictors ‘hidden’ within thousands of variables, providing an efficient and pragmatic solution for the first stage of hypothesis free risk factor identification.
format	article
author	Iqbal Madakkatel Ang Zhou Mark D. McDonnell Elina Hyppönen
author_facet	Iqbal Madakkatel Ang Zhou Mark D. McDonnell Elina Hyppönen
author_sort	Iqbal Madakkatel
title	Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study
title_short	Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study
title_full	Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study
title_fullStr	Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study
title_full_unstemmed	Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study
title_sort	combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study
publisher	Nature Portfolio
publishDate	2021
url	https://doaj.org/article/8d07dd6dd43b4072bad55c2c9fa43b2b
work_keys_str_mv	AT iqbalmadakkatel combiningmachinelearningandconventionalstatisticalapproachesforriskfactordiscoveryinalargecohortstudy AT angzhou combiningmachinelearningandconventionalstatisticalapproachesforriskfactordiscoveryinalargecohortstudy AT markdmcdonnell combiningmachinelearningandconventionalstatisticalapproachesforriskfactordiscoveryinalargecohortstudy AT elinahypponen combiningmachinelearningandconventionalstatisticalapproachesforriskfactordiscoveryinalargecohortstudy
_version_	1718408088553259008

Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study

Ejemplares similares