Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models

Abstract In view of the growth of clinical risk prediction models using genetic data, there is an increasing need for studies that use appropriate methods to select the optimum number of features from a large number of genetic variants with a high degree of redundancy between features due to linkage...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autores principales:	Farideh Jalali-najafabadi, Michael Stadler, Nick Dand, Deepak Jadon, Mehreen Soomro, Pauline Ho, Helen Marzo-Ortega, Philip Helliwell, Eleanor Korendowych, Michael A. Simpson, Jonathan Packham, Catherine H. Smith, Jonathan N. Barker, Neil McHugh, Richard B. Warren, Anne Barton, John Bowes, BADBIR Study Group, BSTOP Study Group
Formato:	article
Lenguaje:	EN
Publicado:	Nature Portfolio 2021
Materias:	Medicine R Science Q
Acceso en línea:	https://doaj.org/article/ef9896df34ad40f89822e14d6ff1f794
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

id	oai:doaj.org-article:ef9896df34ad40f89822e14d6ff1f794
record_format	dspace
institution	DOAJ
collection	DOAJ
language	EN
topic	Medicine R Science Q
spellingShingle	Medicine R Science Q Farideh Jalali-najafabadi Michael Stadler Nick Dand Deepak Jadon Mehreen Soomro Pauline Ho Helen Marzo-Ortega Philip Helliwell Eleanor Korendowych Michael A. Simpson Jonathan Packham Catherine H. Smith Jonathan N. Barker Neil McHugh Richard B. Warren Anne Barton John Bowes BADBIR Study Group BSTOP Study Group Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models
description	Abstract In view of the growth of clinical risk prediction models using genetic data, there is an increasing need for studies that use appropriate methods to select the optimum number of features from a large number of genetic variants with a high degree of redundancy between features due to linkage disequilibrium (LD). Filter feature selection methods based on information theoretic criteria, are well suited to this challenge and will identify a subset of the original variables that should result in more accurate prediction. However, data collected from cohort studies are often high-dimensional genetic data with potential confounders presenting challenges to feature selection and risk prediction machine learning models. Patients with psoriasis are at high risk of developing a chronic arthritis known as psoriatic arthritis (PsA). The prevalence of PsA in this patient group can be up to 30% and the identification of high risk patients represents an important clinical research which would allow early intervention and a reduction of disability. This also provides us with an ideal scenario for the development of clinical risk prediction models and an opportunity to explore the application of information theoretic criteria methods. In this study, we developed the feature selection and psoriatic arthritis (PsA) risk prediction models that were applied to a cross-sectional genetic dataset of 1462 PsA cases and 1132 cutaneous-only psoriasis (PsC) cases using 2-digit HLA alleles imputed using the SNP2HLA algorithm. We also developed stratification method to mitigate the impact of potential confounder features and illustrate that confounding features impact the feature selection. The mitigated dataset was used in training of seven supervised algorithms. 80% of data was randomly used for training of seven supervised machine learning methods using stratified nested cross validation and 20% was selected randomly as a holdout set for internal validation. The risk prediction models were then further validated in UK Biobank dataset containing data on 1187 participants and a set of features overlapping with the training dataset.Performance of these methods has been evaluated using the area under the curve (AUC), accuracy, precision, recall, F1 score and decision curve analysis(net benefit). The best model is selected based on three criteria: the ‘lowest number of feature subset’ with the ‘maximal average AUC over the nested cross validation’ and good generalisability to the UK Biobank dataset. In the original dataset, with over 100 different bootstraps and seven feature selection (FS) methods, HLA_C_06 was selected as the most informative genetic variant. When the dataset is mitigated the single most important genetic features based on rank was identified as HLA_B_27 by the seven different feature selection methods, consistent with previous analyses of this data using regression based methods. However, the predictive accuracy of these single features in post mitigation was found to be moderate (AUC= 0.54 (internal cross validation), AUC=0.53 (internal hold out set), AUC=0.55(external data set)). Sequentially adding additional HLA features based on rank improved the performance of the Random Forest classification model where 20 2-digit features selected by Interaction Capping (ICAP) demonstrated (AUC= 0.61 (internal cross validation), AUC=0.57 (internal hold out set), AUC=0.58 (external dataset)). The stratification method for mitigation of confounding features and filter information theoretic feature selection can be applied to a high dimensional dataset with the potential confounders.
format	article
author	Farideh Jalali-najafabadi Michael Stadler Nick Dand Deepak Jadon Mehreen Soomro Pauline Ho Helen Marzo-Ortega Philip Helliwell Eleanor Korendowych Michael A. Simpson Jonathan Packham Catherine H. Smith Jonathan N. Barker Neil McHugh Richard B. Warren Anne Barton John Bowes BADBIR Study Group BSTOP Study Group
author_facet	Farideh Jalali-najafabadi Michael Stadler Nick Dand Deepak Jadon Mehreen Soomro Pauline Ho Helen Marzo-Ortega Philip Helliwell Eleanor Korendowych Michael A. Simpson Jonathan Packham Catherine H. Smith Jonathan N. Barker Neil McHugh Richard B. Warren Anne Barton John Bowes BADBIR Study Group BSTOP Study Group
author_sort	Farideh Jalali-najafabadi
title	Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models
title_short	Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models
title_full	Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models
title_fullStr	Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models
title_full_unstemmed	Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models
title_sort	application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models
publisher	Nature Portfolio
publishDate	2021
url	https://doaj.org/article/ef9896df34ad40f89822e14d6ff1f794
work_keys_str_mv	AT faridehjalalinajafabadi applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels AT michaelstadler applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels AT nickdand applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels AT deepakjadon applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels AT mehreensoomro applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels AT paulineho applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels AT helenmarzoortega applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels AT philiphelliwell applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels AT eleanorkorendowych applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels AT michaelasimpson applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels AT jonathanpackham applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels AT catherinehsmith applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels AT jonathannbarker applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels AT neilmchugh applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels AT richardbwarren applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels AT annebarton applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels AT johnbowes applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels AT badbirstudygroup applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels AT bstopstudygroup applicationofinformationtheoreticfeatureselectionandmachinelearningmethodsforthedevelopmentofgeneticriskpredictionmodels
_version_	1718372169555116032
spelling	oai:doaj.org-article:ef9896df34ad40f89822e14d6ff1f7942021-12-05T12:11:43ZApplication of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models10.1038/s41598-021-00854-x2045-2322https://doaj.org/article/ef9896df34ad40f89822e14d6ff1f7942021-12-01T00:00:00Zhttps://doi.org/10.1038/s41598-021-00854-xhttps://doaj.org/toc/2045-2322Abstract In view of the growth of clinical risk prediction models using genetic data, there is an increasing need for studies that use appropriate methods to select the optimum number of features from a large number of genetic variants with a high degree of redundancy between features due to linkage disequilibrium (LD). Filter feature selection methods based on information theoretic criteria, are well suited to this challenge and will identify a subset of the original variables that should result in more accurate prediction. However, data collected from cohort studies are often high-dimensional genetic data with potential confounders presenting challenges to feature selection and risk prediction machine learning models. Patients with psoriasis are at high risk of developing a chronic arthritis known as psoriatic arthritis (PsA). The prevalence of PsA in this patient group can be up to 30% and the identification of high risk patients represents an important clinical research which would allow early intervention and a reduction of disability. This also provides us with an ideal scenario for the development of clinical risk prediction models and an opportunity to explore the application of information theoretic criteria methods. In this study, we developed the feature selection and psoriatic arthritis (PsA) risk prediction models that were applied to a cross-sectional genetic dataset of 1462 PsA cases and 1132 cutaneous-only psoriasis (PsC) cases using 2-digit HLA alleles imputed using the SNP2HLA algorithm. We also developed stratification method to mitigate the impact of potential confounder features and illustrate that confounding features impact the feature selection. The mitigated dataset was used in training of seven supervised algorithms. 80% of data was randomly used for training of seven supervised machine learning methods using stratified nested cross validation and 20% was selected randomly as a holdout set for internal validation. The risk prediction models were then further validated in UK Biobank dataset containing data on 1187 participants and a set of features overlapping with the training dataset.Performance of these methods has been evaluated using the area under the curve (AUC), accuracy, precision, recall, F1 score and decision curve analysis(net benefit). The best model is selected based on three criteria: the ‘lowest number of feature subset’ with the ‘maximal average AUC over the nested cross validation’ and good generalisability to the UK Biobank dataset. In the original dataset, with over 100 different bootstraps and seven feature selection (FS) methods, HLA_C_06 was selected as the most informative genetic variant. When the dataset is mitigated the single most important genetic features based on rank was identified as HLA_B_27 by the seven different feature selection methods, consistent with previous analyses of this data using regression based methods. However, the predictive accuracy of these single features in post mitigation was found to be moderate (AUC= 0.54 (internal cross validation), AUC=0.53 (internal hold out set), AUC=0.55(external data set)). Sequentially adding additional HLA features based on rank improved the performance of the Random Forest classification model where 20 2-digit features selected by Interaction Capping (ICAP) demonstrated (AUC= 0.61 (internal cross validation), AUC=0.57 (internal hold out set), AUC=0.58 (external dataset)). The stratification method for mitigation of confounding features and filter information theoretic feature selection can be applied to a high dimensional dataset with the potential confounders.Farideh Jalali-najafabadiMichael StadlerNick DandDeepak JadonMehreen SoomroPauline HoHelen Marzo-OrtegaPhilip HelliwellEleanor KorendowychMichael A. SimpsonJonathan PackhamCatherine H. SmithJonathan N. BarkerNeil McHughRichard B. WarrenAnne BartonJohn BowesBADBIR Study GroupBSTOP Study GroupNature PortfolioarticleMedicineRScienceQENScientific Reports, Vol 11, Iss 1, Pp 1-14 (2021)

Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models

Ejemplares similares