Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results.

This study aims to determine how randomly splitting a dataset into training and test sets affects the estimated performance of a machine learning model and its gap from the test performance under different conditions, using real-world brain tumor radiomics data. We conducted two classification tasks...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Chansik An, Yae Won Park, Sung Soo Ahn, Kyunghwa Han, Hwiyoung Kim, Seung-Koo Lee
Formato: article
Lenguaje:EN
Publicado: Public Library of Science (PLoS) 2021
Materias:
R
Q
Acceso en línea:https://doaj.org/article/b7eeddeca55a457896778de5c6af09dc
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:b7eeddeca55a457896778de5c6af09dc
record_format dspace
spelling oai:doaj.org-article:b7eeddeca55a457896778de5c6af09dc2021-12-02T20:18:13ZRadiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results.1932-620310.1371/journal.pone.0256152https://doaj.org/article/b7eeddeca55a457896778de5c6af09dc2021-01-01T00:00:00Zhttps://doi.org/10.1371/journal.pone.0256152https://doaj.org/toc/1932-6203This study aims to determine how randomly splitting a dataset into training and test sets affects the estimated performance of a machine learning model and its gap from the test performance under different conditions, using real-world brain tumor radiomics data. We conducted two classification tasks of different difficulty levels with magnetic resonance imaging (MRI) radiomics features: (1) "Simple" task, glioblastomas [n = 109] vs. brain metastasis [n = 58] and (2) "difficult" task, low- [n = 163] vs. high-grade [n = 95] meningiomas. Additionally, two undersampled datasets were created by randomly sampling 50% from these datasets. We performed random training-test set splitting for each dataset repeatedly to create 1,000 different training-test set pairs. For each dataset pair, the least absolute shrinkage and selection operator model was trained and evaluated using various validation methods in the training set, and tested in the test set, using the area under the curve (AUC) as an evaluation metric. The AUCs in training and testing varied among different training-test set pairs, especially with the undersampled datasets and the difficult task. The mean (±standard deviation) AUC difference between training and testing was 0.039 (±0.032) for the simple task without undersampling and 0.092 (±0.071) for the difficult task with undersampling. In a training-test set pair with the difficult task without undersampling, for example, the AUC was high in training but much lower in testing (0.882 and 0.667, respectively); in another dataset pair with the same task, however, the AUC was low in training but much higher in testing (0.709 and 0.911, respectively). When the AUC discrepancy between training and test, or generalization gap, was large, none of the validation methods helped sufficiently reduce the generalization gap. Our results suggest that machine learning after a single random training-test set split may lead to unreliable results in radiomics studies especially with small sample sizes.Chansik AnYae Won ParkSung Soo AhnKyunghwa HanHwiyoung KimSeung-Koo LeePublic Library of Science (PLoS)articleMedicineRScienceQENPLoS ONE, Vol 16, Iss 8, p e0256152 (2021)
institution DOAJ
collection DOAJ
language EN
topic Medicine
R
Science
Q
spellingShingle Medicine
R
Science
Q
Chansik An
Yae Won Park
Sung Soo Ahn
Kyunghwa Han
Hwiyoung Kim
Seung-Koo Lee
Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results.
description This study aims to determine how randomly splitting a dataset into training and test sets affects the estimated performance of a machine learning model and its gap from the test performance under different conditions, using real-world brain tumor radiomics data. We conducted two classification tasks of different difficulty levels with magnetic resonance imaging (MRI) radiomics features: (1) "Simple" task, glioblastomas [n = 109] vs. brain metastasis [n = 58] and (2) "difficult" task, low- [n = 163] vs. high-grade [n = 95] meningiomas. Additionally, two undersampled datasets were created by randomly sampling 50% from these datasets. We performed random training-test set splitting for each dataset repeatedly to create 1,000 different training-test set pairs. For each dataset pair, the least absolute shrinkage and selection operator model was trained and evaluated using various validation methods in the training set, and tested in the test set, using the area under the curve (AUC) as an evaluation metric. The AUCs in training and testing varied among different training-test set pairs, especially with the undersampled datasets and the difficult task. The mean (±standard deviation) AUC difference between training and testing was 0.039 (±0.032) for the simple task without undersampling and 0.092 (±0.071) for the difficult task with undersampling. In a training-test set pair with the difficult task without undersampling, for example, the AUC was high in training but much lower in testing (0.882 and 0.667, respectively); in another dataset pair with the same task, however, the AUC was low in training but much higher in testing (0.709 and 0.911, respectively). When the AUC discrepancy between training and test, or generalization gap, was large, none of the validation methods helped sufficiently reduce the generalization gap. Our results suggest that machine learning after a single random training-test set split may lead to unreliable results in radiomics studies especially with small sample sizes.
format article
author Chansik An
Yae Won Park
Sung Soo Ahn
Kyunghwa Han
Hwiyoung Kim
Seung-Koo Lee
author_facet Chansik An
Yae Won Park
Sung Soo Ahn
Kyunghwa Han
Hwiyoung Kim
Seung-Koo Lee
author_sort Chansik An
title Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results.
title_short Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results.
title_full Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results.
title_fullStr Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results.
title_full_unstemmed Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results.
title_sort radiomics machine learning study with a small sample size: single random training-test set split may lead to unreliable results.
publisher Public Library of Science (PLoS)
publishDate 2021
url https://doaj.org/article/b7eeddeca55a457896778de5c6af09dc
work_keys_str_mv AT chansikan radiomicsmachinelearningstudywithasmallsamplesizesinglerandomtrainingtestsetsplitmayleadtounreliableresults
AT yaewonpark radiomicsmachinelearningstudywithasmallsamplesizesinglerandomtrainingtestsetsplitmayleadtounreliableresults
AT sungsooahn radiomicsmachinelearningstudywithasmallsamplesizesinglerandomtrainingtestsetsplitmayleadtounreliableresults
AT kyunghwahan radiomicsmachinelearningstudywithasmallsamplesizesinglerandomtrainingtestsetsplitmayleadtounreliableresults
AT hwiyoungkim radiomicsmachinelearningstudywithasmallsamplesizesinglerandomtrainingtestsetsplitmayleadtounreliableresults
AT seungkoolee radiomicsmachinelearningstudywithasmallsamplesizesinglerandomtrainingtestsetsplitmayleadtounreliableresults
_version_ 1718374307338387456