Generating high-fidelity synthetic patient data for assessing machine learning healthcare software

Abstract There is a growing demand for the uptake of modern artificial intelligence technologies within healthcare systems. Many of these technologies exploit historical patient health data to build powerful predictive models that can be used to improve diagnosis and understanding of disease. Howeve...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autores principales:	Allan Tucker, Zhenchen Wang, Ylenia Rotalinti, Puja Myles
Formato:	article
Lenguaje:	EN
Publicado:	Nature Portfolio 2020
Materias:	Computer applications to medicine. Medical informatics R858-859.7
Acceso en línea:	https://doaj.org/article/393faecb6a464df29677635072a0df65
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

id	oai:doaj.org-article:393faecb6a464df29677635072a0df65
record_format	dspace
spelling	oai:doaj.org-article:393faecb6a464df29677635072a0df652021-12-02T14:28:17ZGenerating high-fidelity synthetic patient data for assessing machine learning healthcare software10.1038/s41746-020-00353-92398-6352https://doaj.org/article/393faecb6a464df29677635072a0df652020-11-01T00:00:00Zhttps://doi.org/10.1038/s41746-020-00353-9https://doaj.org/toc/2398-6352Abstract There is a growing demand for the uptake of modern artificial intelligence technologies within healthcare systems. Many of these technologies exploit historical patient health data to build powerful predictive models that can be used to improve diagnosis and understanding of disease. However, there are many issues concerning patient privacy that need to be accounted for in order to enable this data to be better harnessed by all sectors. One approach that could offer a method of circumventing privacy issues is the creation of realistic synthetic data sets that capture as many of the complexities of the original data set (distributions, non-linear relationships, and noise) but that does not actually include any real patient data. While previous research has explored models for generating synthetic data sets, here we explore the integration of resampling, probabilistic graphical modelling, latent variable identification, and outlier analysis for producing realistic synthetic data based on UK primary care patient data. In particular, we focus on handling missingness, complex interactions between variables, and the resulting sensitivity analysis statistics from machine learning classifiers, while quantifying the risks of patient re-identification from synthetic datapoints. We show that, through our approach of integrating outlier analysis with graphical modelling and resampling, we can achieve synthetic data sets that are not significantly different from original ground truth data in terms of feature distributions, feature dependencies, and sensitivity analysis statistics when inferring machine learning classifiers. What is more, the risk of generating synthetic data that is identical or very similar to real patients is shown to be low.Allan TuckerZhenchen WangYlenia RotalintiPuja MylesNature PortfolioarticleComputer applications to medicine. Medical informaticsR858-859.7ENnpj Digital Medicine, Vol 3, Iss 1, Pp 1-13 (2020)
institution	DOAJ
collection	DOAJ
language	EN
topic	Computer applications to medicine. Medical informatics R858-859.7
spellingShingle	Computer applications to medicine. Medical informatics R858-859.7 Allan Tucker Zhenchen Wang Ylenia Rotalinti Puja Myles Generating high-fidelity synthetic patient data for assessing machine learning healthcare software
description	Abstract There is a growing demand for the uptake of modern artificial intelligence technologies within healthcare systems. Many of these technologies exploit historical patient health data to build powerful predictive models that can be used to improve diagnosis and understanding of disease. However, there are many issues concerning patient privacy that need to be accounted for in order to enable this data to be better harnessed by all sectors. One approach that could offer a method of circumventing privacy issues is the creation of realistic synthetic data sets that capture as many of the complexities of the original data set (distributions, non-linear relationships, and noise) but that does not actually include any real patient data. While previous research has explored models for generating synthetic data sets, here we explore the integration of resampling, probabilistic graphical modelling, latent variable identification, and outlier analysis for producing realistic synthetic data based on UK primary care patient data. In particular, we focus on handling missingness, complex interactions between variables, and the resulting sensitivity analysis statistics from machine learning classifiers, while quantifying the risks of patient re-identification from synthetic datapoints. We show that, through our approach of integrating outlier analysis with graphical modelling and resampling, we can achieve synthetic data sets that are not significantly different from original ground truth data in terms of feature distributions, feature dependencies, and sensitivity analysis statistics when inferring machine learning classifiers. What is more, the risk of generating synthetic data that is identical or very similar to real patients is shown to be low.
format	article
author	Allan Tucker Zhenchen Wang Ylenia Rotalinti Puja Myles
author_facet	Allan Tucker Zhenchen Wang Ylenia Rotalinti Puja Myles
author_sort	Allan Tucker
title	Generating high-fidelity synthetic patient data for assessing machine learning healthcare software
title_short	Generating high-fidelity synthetic patient data for assessing machine learning healthcare software
title_full	Generating high-fidelity synthetic patient data for assessing machine learning healthcare software
title_fullStr	Generating high-fidelity synthetic patient data for assessing machine learning healthcare software
title_full_unstemmed	Generating high-fidelity synthetic patient data for assessing machine learning healthcare software
title_sort	generating high-fidelity synthetic patient data for assessing machine learning healthcare software
publisher	Nature Portfolio
publishDate	2020
url	https://doaj.org/article/393faecb6a464df29677635072a0df65
work_keys_str_mv	AT allantucker generatinghighfidelitysyntheticpatientdataforassessingmachinelearninghealthcaresoftware AT zhenchenwang generatinghighfidelitysyntheticpatientdataforassessingmachinelearninghealthcaresoftware AT yleniarotalinti generatinghighfidelitysyntheticpatientdataforassessingmachinelearninghealthcaresoftware AT pujamyles generatinghighfidelitysyntheticpatientdataforassessingmachinelearninghealthcaresoftware
_version_	1718391244523044864

Generating high-fidelity synthetic patient data for assessing machine learning healthcare software

Ejemplares similares