Improving random forest predictions in small datasets from two-phase sampling designs

Abstract Background While random forests are one of the most successful machine learning methods, it is necessary to optimize their performance for use with datasets resulting from a two-phase sampling design with a small number of cases—a common situation in biomedical studies, which often have rar...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autores principales:	Sunwoo Han, Brian D. Williamson, Youyi Fong
Formato:	article
Lenguaje:	EN
Publicado:	BMC 2021
Materias:	Case–control design Variable screening Class imbalance HIV vaccine Computer applications to medicine. Medical informatics R858-859.7
Acceso en línea:	https://doaj.org/article/59888be2e459495c93e907d674a72e1a
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

id	oai:doaj.org-article:59888be2e459495c93e907d674a72e1a
record_format	dspace
spelling	oai:doaj.org-article:59888be2e459495c93e907d674a72e1a2021-11-28T12:26:07ZImproving random forest predictions in small datasets from two-phase sampling designs10.1186/s12911-021-01688-31472-6947https://doaj.org/article/59888be2e459495c93e907d674a72e1a2021-11-01T00:00:00Zhttps://doi.org/10.1186/s12911-021-01688-3https://doaj.org/toc/1472-6947Abstract Background While random forests are one of the most successful machine learning methods, it is necessary to optimize their performance for use with datasets resulting from a two-phase sampling design with a small number of cases—a common situation in biomedical studies, which often have rare outcomes and covariates whose measurement is resource-intensive. Methods Using an immunologic marker dataset from a phase III HIV vaccine efficacy trial, we seek to optimize random forest prediction performance using combinations of variable screening, class balancing, weighting, and hyperparameter tuning. Results Our experiments show that while class balancing helps improve random forest prediction performance when variable screening is not applied, class balancing has a negative impact on performance in the presence of variable screening. The impact of the weighting similarly depends on whether variable screening is applied. Hyperparameter tuning is ineffective in situations with small sample sizes. We further show that random forests under-perform generalized linear models for some subsets of markers, and prediction performance on this dataset can be improved by stacking random forests and generalized linear models trained on different subsets of predictors, and that the extent of improvement depends critically on the dissimilarities between candidate learner predictions. Conclusion In small datasets from two-phase sampling design, variable screening and inverse sampling probability weighting are important for achieving good prediction performance of random forests. In addition, stacking random forests and simple linear models can offer improvements over random forests.Sunwoo HanBrian D. WilliamsonYouyi FongBMCarticleCase–control designVariable screeningClass imbalanceHIV vaccineComputer applications to medicine. Medical informaticsR858-859.7ENBMC Medical Informatics and Decision Making, Vol 21, Iss 1, Pp 1-9 (2021)
institution	DOAJ
collection	DOAJ
language	EN
topic	Case–control design Variable screening Class imbalance HIV vaccine Computer applications to medicine. Medical informatics R858-859.7
spellingShingle	Case–control design Variable screening Class imbalance HIV vaccine Computer applications to medicine. Medical informatics R858-859.7 Sunwoo Han Brian D. Williamson Youyi Fong Improving random forest predictions in small datasets from two-phase sampling designs
description	Abstract Background While random forests are one of the most successful machine learning methods, it is necessary to optimize their performance for use with datasets resulting from a two-phase sampling design with a small number of cases—a common situation in biomedical studies, which often have rare outcomes and covariates whose measurement is resource-intensive. Methods Using an immunologic marker dataset from a phase III HIV vaccine efficacy trial, we seek to optimize random forest prediction performance using combinations of variable screening, class balancing, weighting, and hyperparameter tuning. Results Our experiments show that while class balancing helps improve random forest prediction performance when variable screening is not applied, class balancing has a negative impact on performance in the presence of variable screening. The impact of the weighting similarly depends on whether variable screening is applied. Hyperparameter tuning is ineffective in situations with small sample sizes. We further show that random forests under-perform generalized linear models for some subsets of markers, and prediction performance on this dataset can be improved by stacking random forests and generalized linear models trained on different subsets of predictors, and that the extent of improvement depends critically on the dissimilarities between candidate learner predictions. Conclusion In small datasets from two-phase sampling design, variable screening and inverse sampling probability weighting are important for achieving good prediction performance of random forests. In addition, stacking random forests and simple linear models can offer improvements over random forests.
format	article
author	Sunwoo Han Brian D. Williamson Youyi Fong
author_facet	Sunwoo Han Brian D. Williamson Youyi Fong
author_sort	Sunwoo Han
title	Improving random forest predictions in small datasets from two-phase sampling designs
title_short	Improving random forest predictions in small datasets from two-phase sampling designs
title_full	Improving random forest predictions in small datasets from two-phase sampling designs
title_fullStr	Improving random forest predictions in small datasets from two-phase sampling designs
title_full_unstemmed	Improving random forest predictions in small datasets from two-phase sampling designs
title_sort	improving random forest predictions in small datasets from two-phase sampling designs
publisher	BMC
publishDate	2021
url	https://doaj.org/article/59888be2e459495c93e907d674a72e1a
work_keys_str_mv	AT sunwoohan improvingrandomforestpredictionsinsmalldatasetsfromtwophasesamplingdesigns AT briandwilliamson improvingrandomforestpredictionsinsmalldatasetsfromtwophasesamplingdesigns AT youyifong improvingrandomforestpredictionsinsmalldatasetsfromtwophasesamplingdesigns
_version_	1718407954128961536

Improving random forest predictions in small datasets from two-phase sampling designs

Ejemplares similares