A data driven learning approach for the assessment of data quality

Abstract Background Data quality assessment is important but complex and task dependent. Identifying suitable measurement methods and reference ranges for assessing their results is challenging. Manually inspecting the measurement results and current data driven approaches for learning which results...

Full description

Saved in:

Bibliographic Details
Main Authors:	Erik Tute, Nagarajan Ganapathy, Antje Wulff
Format:	article
Language:	EN
Published:	BMC 2021
Subjects:	Information science Data quality Data aggregation Knowledge bases Machine learning Computer applications to medicine. Medical informatics R858-859.7
Online Access:	https://doaj.org/article/65c16f3ca3e645168cd282dd4ae49c6e
Tags:	Add Tag No Tags, Be the first to tag this record!

id	oai:doaj.org-article:65c16f3ca3e645168cd282dd4ae49c6e
record_format	dspace
spelling	oai:doaj.org-article:65c16f3ca3e645168cd282dd4ae49c6e2021-11-08T10:59:26ZA data driven learning approach for the assessment of data quality10.1186/s12911-021-01656-x1472-6947https://doaj.org/article/65c16f3ca3e645168cd282dd4ae49c6e2021-11-01T00:00:00Zhttps://doi.org/10.1186/s12911-021-01656-xhttps://doaj.org/toc/1472-6947Abstract Background Data quality assessment is important but complex and task dependent. Identifying suitable measurement methods and reference ranges for assessing their results is challenging. Manually inspecting the measurement results and current data driven approaches for learning which results indicate data quality issues have considerable limitations, e.g. to identify task dependent thresholds for measurement results that indicate data quality issues. Objectives To explore the applicability and potential benefits of a data driven approach to learn task dependent knowledge about suitable measurement methods and assessment of their results. Such knowledge could be useful for others to determine whether a local data stock is suitable for a given task. Methods We started by creating artificial data with previously defined data quality issues and applied a set of generic measurement methods on this data (e.g. a method to count the number of values in a certain variable or the mean value of the values). We trained decision trees on exported measurement methods’ results and corresponding outcome data (data that indicated the data’s suitability for a use case). For evaluation, we derived rules for potential measurement methods and reference values from the decision trees and compared these regarding their coverage of the true data quality issues artificially created in the dataset. Three researchers independently derived these rules. One with knowledge about present data quality issues and two without. Results Our self-trained decision trees were able to indicate rules for 12 of 19 previously defined data quality issues. Learned knowledge about measurement methods and their assessment was complementary to manual interpretation of measurement methods’ results. Conclusions Our data driven approach derives sensible knowledge for task dependent data quality assessment and complements other current approaches. Based on labeled measurement methods’ results as training data, our approach successfully suggested applicable rules for checking data quality characteristics that determine whether a dataset is suitable for a given task.Erik TuteNagarajan GanapathyAntje WulffBMCarticleInformation scienceData qualityData aggregationKnowledge basesMachine learningComputer applications to medicine. Medical informaticsR858-859.7ENBMC Medical Informatics and Decision Making, Vol 21, Iss 1, Pp 1-11 (2021)
institution	DOAJ
collection	DOAJ
language	EN
topic	Information science Data quality Data aggregation Knowledge bases Machine learning Computer applications to medicine. Medical informatics R858-859.7
spellingShingle	Information science Data quality Data aggregation Knowledge bases Machine learning Computer applications to medicine. Medical informatics R858-859.7 Erik Tute Nagarajan Ganapathy Antje Wulff A data driven learning approach for the assessment of data quality
description	Abstract Background Data quality assessment is important but complex and task dependent. Identifying suitable measurement methods and reference ranges for assessing their results is challenging. Manually inspecting the measurement results and current data driven approaches for learning which results indicate data quality issues have considerable limitations, e.g. to identify task dependent thresholds for measurement results that indicate data quality issues. Objectives To explore the applicability and potential benefits of a data driven approach to learn task dependent knowledge about suitable measurement methods and assessment of their results. Such knowledge could be useful for others to determine whether a local data stock is suitable for a given task. Methods We started by creating artificial data with previously defined data quality issues and applied a set of generic measurement methods on this data (e.g. a method to count the number of values in a certain variable or the mean value of the values). We trained decision trees on exported measurement methods’ results and corresponding outcome data (data that indicated the data’s suitability for a use case). For evaluation, we derived rules for potential measurement methods and reference values from the decision trees and compared these regarding their coverage of the true data quality issues artificially created in the dataset. Three researchers independently derived these rules. One with knowledge about present data quality issues and two without. Results Our self-trained decision trees were able to indicate rules for 12 of 19 previously defined data quality issues. Learned knowledge about measurement methods and their assessment was complementary to manual interpretation of measurement methods’ results. Conclusions Our data driven approach derives sensible knowledge for task dependent data quality assessment and complements other current approaches. Based on labeled measurement methods’ results as training data, our approach successfully suggested applicable rules for checking data quality characteristics that determine whether a dataset is suitable for a given task.
format	article
author	Erik Tute Nagarajan Ganapathy Antje Wulff
author_facet	Erik Tute Nagarajan Ganapathy Antje Wulff
author_sort	Erik Tute
title	A data driven learning approach for the assessment of data quality
title_short	A data driven learning approach for the assessment of data quality
title_full	A data driven learning approach for the assessment of data quality
title_fullStr	A data driven learning approach for the assessment of data quality
title_full_unstemmed	A data driven learning approach for the assessment of data quality
title_sort	data driven learning approach for the assessment of data quality
publisher	BMC
publishDate	2021
url	https://doaj.org/article/65c16f3ca3e645168cd282dd4ae49c6e
work_keys_str_mv	AT eriktute adatadrivenlearningapproachfortheassessmentofdataquality AT nagarajanganapathy adatadrivenlearningapproachfortheassessmentofdataquality AT antjewulff adatadrivenlearningapproachfortheassessmentofdataquality AT eriktute datadrivenlearningapproachfortheassessmentofdataquality AT nagarajanganapathy datadrivenlearningapproachfortheassessmentofdataquality AT antjewulff datadrivenlearningapproachfortheassessmentofdataquality
_version_	1718442432501121024

A data driven learning approach for the assessment of data quality

Similar Items