Evaluating High-Variance Leaves as Uncertainty Measure for Random Forest Regression

Uncertainty measures estimate the reliability of a predictive model. Especially in the field of molecular property prediction as part of drug design, model reliability is crucial. Besides other techniques, Random Forests have a long tradition in machine learning related to chemoinformatics and are w...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Thomas-Martin Dutschmann, Knut Baumann
Formato: article
Lenguaje:EN
Publicado: MDPI AG 2021
Materias:
Acceso en línea:https://doaj.org/article/d15d70fe033b45f9aeceb06f4d3b08ee
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:d15d70fe033b45f9aeceb06f4d3b08ee
record_format dspace
spelling oai:doaj.org-article:d15d70fe033b45f9aeceb06f4d3b08ee2021-11-11T18:30:28ZEvaluating High-Variance Leaves as Uncertainty Measure for Random Forest Regression10.3390/molecules262165141420-3049https://doaj.org/article/d15d70fe033b45f9aeceb06f4d3b08ee2021-10-01T00:00:00Zhttps://www.mdpi.com/1420-3049/26/21/6514https://doaj.org/toc/1420-3049Uncertainty measures estimate the reliability of a predictive model. Especially in the field of molecular property prediction as part of drug design, model reliability is crucial. Besides other techniques, Random Forests have a long tradition in machine learning related to chemoinformatics and are widely used. Random Forests consist of an ensemble of individual regression models, namely, decision trees and, therefore, provide an uncertainty measure already by construction. Regarding the disagreement of single-model predictions, a narrower distribution of predictions is interpreted as a higher reliability. The standard deviation of the decision tree ensemble predictions is the default uncertainty measure for Random Forests. Due to the increasing application of machine learning in drug design, there is a constant search for novel uncertainty measures that, ideally, outperform classical uncertainty criteria. When analyzing Random Forests, it appears obvious to consider the variance of the dependent variables within each terminal decision tree leaf to obtain predictive uncertainties. Hereby, predictions that arise from more leaves of high variance are considered less reliable. Expectedly, the number of such high-variance leaves yields a reasonable uncertainty measure. Depending on the dataset, it can also outperform ensemble uncertainties. However, small-scale comparisons, i.e., considering only a few datasets, are insufficient, since they are more prone to chance correlations. Therefore, large-scale estimations are required to make general claims about the performance of uncertainty measures. On several chemoinformatic regression datasets, high-variance leaves are compared to the standard deviation of ensemble predictions. It turns out that high-variance leaf uncertainty is meaningful, not superior to the default ensemble standard deviation. A brief possible explanation is offered.Thomas-Martin DutschmannKnut BaumannMDPI AGarticlechemoinformaticsmachine learningRandom Forestregressionensembleuncertainty measureOrganic chemistryQD241-441ENMolecules, Vol 26, Iss 6514, p 6514 (2021)
institution DOAJ
collection DOAJ
language EN
topic chemoinformatics
machine learning
Random Forest
regression
ensemble
uncertainty measure
Organic chemistry
QD241-441
spellingShingle chemoinformatics
machine learning
Random Forest
regression
ensemble
uncertainty measure
Organic chemistry
QD241-441
Thomas-Martin Dutschmann
Knut Baumann
Evaluating High-Variance Leaves as Uncertainty Measure for Random Forest Regression
description Uncertainty measures estimate the reliability of a predictive model. Especially in the field of molecular property prediction as part of drug design, model reliability is crucial. Besides other techniques, Random Forests have a long tradition in machine learning related to chemoinformatics and are widely used. Random Forests consist of an ensemble of individual regression models, namely, decision trees and, therefore, provide an uncertainty measure already by construction. Regarding the disagreement of single-model predictions, a narrower distribution of predictions is interpreted as a higher reliability. The standard deviation of the decision tree ensemble predictions is the default uncertainty measure for Random Forests. Due to the increasing application of machine learning in drug design, there is a constant search for novel uncertainty measures that, ideally, outperform classical uncertainty criteria. When analyzing Random Forests, it appears obvious to consider the variance of the dependent variables within each terminal decision tree leaf to obtain predictive uncertainties. Hereby, predictions that arise from more leaves of high variance are considered less reliable. Expectedly, the number of such high-variance leaves yields a reasonable uncertainty measure. Depending on the dataset, it can also outperform ensemble uncertainties. However, small-scale comparisons, i.e., considering only a few datasets, are insufficient, since they are more prone to chance correlations. Therefore, large-scale estimations are required to make general claims about the performance of uncertainty measures. On several chemoinformatic regression datasets, high-variance leaves are compared to the standard deviation of ensemble predictions. It turns out that high-variance leaf uncertainty is meaningful, not superior to the default ensemble standard deviation. A brief possible explanation is offered.
format article
author Thomas-Martin Dutschmann
Knut Baumann
author_facet Thomas-Martin Dutschmann
Knut Baumann
author_sort Thomas-Martin Dutschmann
title Evaluating High-Variance Leaves as Uncertainty Measure for Random Forest Regression
title_short Evaluating High-Variance Leaves as Uncertainty Measure for Random Forest Regression
title_full Evaluating High-Variance Leaves as Uncertainty Measure for Random Forest Regression
title_fullStr Evaluating High-Variance Leaves as Uncertainty Measure for Random Forest Regression
title_full_unstemmed Evaluating High-Variance Leaves as Uncertainty Measure for Random Forest Regression
title_sort evaluating high-variance leaves as uncertainty measure for random forest regression
publisher MDPI AG
publishDate 2021
url https://doaj.org/article/d15d70fe033b45f9aeceb06f4d3b08ee
work_keys_str_mv AT thomasmartindutschmann evaluatinghighvarianceleavesasuncertaintymeasureforrandomforestregression
AT knutbaumann evaluatinghighvarianceleavesasuncertaintymeasureforrandomforestregression
_version_ 1718431852360892416