Accessible data curation and analytics for international-scale citizen science datasets
Abstract The Covid Symptom Study, a smartphone-based surveillance study on COVID-19 symptoms in the population, is an exemplar of big data citizen science. As of May 23rd, 2021, over 5 million participants have collectively logged over 360 million self-assessment reports since its introduction in Ma...
Guardado en:
Autores principales: | , , , , , , , , , , , , , , , , , , , |
---|---|
Formato: | article |
Lenguaje: | EN |
Publicado: |
Nature Portfolio
2021
|
Materias: | |
Acceso en línea: | https://doaj.org/article/4657a31169ad4964afaba25bc27b394f |
Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
id |
oai:doaj.org-article:4657a31169ad4964afaba25bc27b394f |
---|---|
record_format |
dspace |
spelling |
oai:doaj.org-article:4657a31169ad4964afaba25bc27b394f2021-11-28T12:24:12ZAccessible data curation and analytics for international-scale citizen science datasets10.1038/s41597-021-01071-x2052-4463https://doaj.org/article/4657a31169ad4964afaba25bc27b394f2021-11-01T00:00:00Zhttps://doi.org/10.1038/s41597-021-01071-xhttps://doaj.org/toc/2052-4463Abstract The Covid Symptom Study, a smartphone-based surveillance study on COVID-19 symptoms in the population, is an exemplar of big data citizen science. As of May 23rd, 2021, over 5 million participants have collectively logged over 360 million self-assessment reports since its introduction in March 2020. The success of the Covid Symptom Study creates significant technical challenges around effective data curation. The primary issue is scale. The size of the dataset means that it can no longer be readily processed using standard Python-based data analytics software such as Pandas on commodity hardware. Alternative technologies exist but carry a higher technical complexity and are less accessible to many researchers. We present ExeTera, a Python-based open source software package designed to provide Pandas-like data analytics on datasets that approach terabyte scales. We present its design and capabilities, and show how it is a critical component of a data curation pipeline that enables reproducible research across an international research group for the Covid Symptom Study.Benjamin MurrayEric KerfootLiyuan ChenJie DengMark S. GrahamCarole H. SudreErika MolteniLiane S. CanasMichela AntonelliKerstin KlaserAlessia ViscontiAlexander HammersAndrew T. ChanPaul W. FranksRichard DaviesJonathan WolfTim D. SpectorClaire J. StevesMarc ModatSebastien OurselinNature PortfolioarticleScienceQENScientific Data, Vol 8, Iss 1, Pp 1-17 (2021) |
institution |
DOAJ |
collection |
DOAJ |
language |
EN |
topic |
Science Q |
spellingShingle |
Science Q Benjamin Murray Eric Kerfoot Liyuan Chen Jie Deng Mark S. Graham Carole H. Sudre Erika Molteni Liane S. Canas Michela Antonelli Kerstin Klaser Alessia Visconti Alexander Hammers Andrew T. Chan Paul W. Franks Richard Davies Jonathan Wolf Tim D. Spector Claire J. Steves Marc Modat Sebastien Ourselin Accessible data curation and analytics for international-scale citizen science datasets |
description |
Abstract The Covid Symptom Study, a smartphone-based surveillance study on COVID-19 symptoms in the population, is an exemplar of big data citizen science. As of May 23rd, 2021, over 5 million participants have collectively logged over 360 million self-assessment reports since its introduction in March 2020. The success of the Covid Symptom Study creates significant technical challenges around effective data curation. The primary issue is scale. The size of the dataset means that it can no longer be readily processed using standard Python-based data analytics software such as Pandas on commodity hardware. Alternative technologies exist but carry a higher technical complexity and are less accessible to many researchers. We present ExeTera, a Python-based open source software package designed to provide Pandas-like data analytics on datasets that approach terabyte scales. We present its design and capabilities, and show how it is a critical component of a data curation pipeline that enables reproducible research across an international research group for the Covid Symptom Study. |
format |
article |
author |
Benjamin Murray Eric Kerfoot Liyuan Chen Jie Deng Mark S. Graham Carole H. Sudre Erika Molteni Liane S. Canas Michela Antonelli Kerstin Klaser Alessia Visconti Alexander Hammers Andrew T. Chan Paul W. Franks Richard Davies Jonathan Wolf Tim D. Spector Claire J. Steves Marc Modat Sebastien Ourselin |
author_facet |
Benjamin Murray Eric Kerfoot Liyuan Chen Jie Deng Mark S. Graham Carole H. Sudre Erika Molteni Liane S. Canas Michela Antonelli Kerstin Klaser Alessia Visconti Alexander Hammers Andrew T. Chan Paul W. Franks Richard Davies Jonathan Wolf Tim D. Spector Claire J. Steves Marc Modat Sebastien Ourselin |
author_sort |
Benjamin Murray |
title |
Accessible data curation and analytics for international-scale citizen science datasets |
title_short |
Accessible data curation and analytics for international-scale citizen science datasets |
title_full |
Accessible data curation and analytics for international-scale citizen science datasets |
title_fullStr |
Accessible data curation and analytics for international-scale citizen science datasets |
title_full_unstemmed |
Accessible data curation and analytics for international-scale citizen science datasets |
title_sort |
accessible data curation and analytics for international-scale citizen science datasets |
publisher |
Nature Portfolio |
publishDate |
2021 |
url |
https://doaj.org/article/4657a31169ad4964afaba25bc27b394f |
work_keys_str_mv |
AT benjaminmurray accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT erickerfoot accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT liyuanchen accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT jiedeng accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT marksgraham accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT carolehsudre accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT erikamolteni accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT lianescanas accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT michelaantonelli accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT kerstinklaser accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT alessiavisconti accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT alexanderhammers accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT andrewtchan accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT paulwfranks accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT richarddavies accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT jonathanwolf accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT timdspector accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT clairejsteves accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT marcmodat accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets AT sebastienourselin accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets |
_version_ |
1718408008977874944 |