Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling).

<h4>Motivation</h4>The size of today's biomedical data sets pushes computer equipment to its limits, even for seemingly standard analysis tasks such as data projection or clustering. Reducing large biomedical data by downsampling is therefore a common early step in data processing,...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autores principales:	Jörn Lötsch, Sebastian Malkusch, Alfred Ultsch
Formato:	article
Lenguaje:	EN
Publicado:	Public Library of Science (PLoS) 2021
Materias:	Medicine R Science Q
Acceso en línea:	https://doaj.org/article/71129c32c75c455e8a1369fa71ed80b0
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

id	oai:doaj.org-article:71129c32c75c455e8a1369fa71ed80b0
record_format	dspace
spelling	oai:doaj.org-article:71129c32c75c455e8a1369fa71ed80b02021-12-02T20:18:38ZOptimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling).1932-620310.1371/journal.pone.0255838https://doaj.org/article/71129c32c75c455e8a1369fa71ed80b02021-01-01T00:00:00Zhttps://doi.org/10.1371/journal.pone.0255838https://doaj.org/toc/1932-6203<h4>Motivation</h4>The size of today's biomedical data sets pushes computer equipment to its limits, even for seemingly standard analysis tasks such as data projection or clustering. Reducing large biomedical data by downsampling is therefore a common early step in data processing, often performed as random uniform class-proportional downsampling. In this report, we hypothesized that this can be optimized to obtain samples that better reflect the entire data set than those obtained using the current standard method.<h4>Results</h4>By repeating the random sampling and comparing the distribution of the drawn sample with the distribution of the original data, it was possible to establish a method for obtaining subsets of data that better reflect the entire data set than taking only the first randomly selected subsample, as is the current standard. Experiments on artificial and real biomedical data sets showed that the reconstruction of the remaining data from the original data set from the downsampled data improved significantly. This was observed with both principal component analysis and autoencoding neural networks. The fidelity was dependent on both the number of cases drawn from the original and the number of samples drawn.<h4>Conclusions</h4>Optimal distribution-preserving class-proportional downsampling yields data subsets that reflect the structure of the entire data better than those obtained with the standard method. By using distributional similarity as the only selection criterion, the proposed method does not in any way affect the results of a later planned analysis.Jörn LötschSebastian MalkuschAlfred UltschPublic Library of Science (PLoS)articleMedicineRScienceQENPLoS ONE, Vol 16, Iss 8, p e0255838 (2021)
institution	DOAJ
collection	DOAJ
language	EN
topic	Medicine R Science Q
spellingShingle	Medicine R Science Q Jörn Lötsch Sebastian Malkusch Alfred Ultsch Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling).
description	<h4>Motivation</h4>The size of today's biomedical data sets pushes computer equipment to its limits, even for seemingly standard analysis tasks such as data projection or clustering. Reducing large biomedical data by downsampling is therefore a common early step in data processing, often performed as random uniform class-proportional downsampling. In this report, we hypothesized that this can be optimized to obtain samples that better reflect the entire data set than those obtained using the current standard method.<h4>Results</h4>By repeating the random sampling and comparing the distribution of the drawn sample with the distribution of the original data, it was possible to establish a method for obtaining subsets of data that better reflect the entire data set than taking only the first randomly selected subsample, as is the current standard. Experiments on artificial and real biomedical data sets showed that the reconstruction of the remaining data from the original data set from the downsampled data improved significantly. This was observed with both principal component analysis and autoencoding neural networks. The fidelity was dependent on both the number of cases drawn from the original and the number of samples drawn.<h4>Conclusions</h4>Optimal distribution-preserving class-proportional downsampling yields data subsets that reflect the structure of the entire data better than those obtained with the standard method. By using distributional similarity as the only selection criterion, the proposed method does not in any way affect the results of a later planned analysis.
format	article
author	Jörn Lötsch Sebastian Malkusch Alfred Ultsch
author_facet	Jörn Lötsch Sebastian Malkusch Alfred Ultsch
author_sort	Jörn Lötsch
title	Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling).
title_short	Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling).
title_full	Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling).
title_fullStr	Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling).
title_full_unstemmed	Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling).
title_sort	optimal distribution-preserving downsampling of large biomedical data sets (opdisdownsampling).
publisher	Public Library of Science (PLoS)
publishDate	2021
url	https://doaj.org/article/71129c32c75c455e8a1369fa71ed80b0
work_keys_str_mv	AT jornlotsch optimaldistributionpreservingdownsamplingoflargebiomedicaldatasetsopdisdownsampling AT sebastianmalkusch optimaldistributionpreservingdownsamplingoflargebiomedicaldatasetsopdisdownsampling AT alfredultsch optimaldistributionpreservingdownsamplingoflargebiomedicaldatasetsopdisdownsampling
_version_	1718374290503499776

Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling).

Ejemplares similares