Accessible data curation and analytics for international-scale citizen science datasets

Abstract The Covid Symptom Study, a smartphone-based surveillance study on COVID-19 symptoms in the population, is an exemplar of big data citizen science. As of May 23rd, 2021, over 5 million participants have collectively logged over 360 million self-assessment reports since its introduction in Ma...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Benjamin Murray, Eric Kerfoot, Liyuan Chen, Jie Deng, Mark S. Graham, Carole H. Sudre, Erika Molteni, Liane S. Canas, Michela Antonelli, Kerstin Klaser, Alessia Visconti, Alexander Hammers, Andrew T. Chan, Paul W. Franks, Richard Davies, Jonathan Wolf, Tim D. Spector, Claire J. Steves, Marc Modat, Sebastien Ourselin
Formato: article
Lenguaje:EN
Publicado: Nature Portfolio 2021
Materias:
Q
Acceso en línea:https://doaj.org/article/4657a31169ad4964afaba25bc27b394f
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:4657a31169ad4964afaba25bc27b394f
record_format dspace
spelling oai:doaj.org-article:4657a31169ad4964afaba25bc27b394f2021-11-28T12:24:12ZAccessible data curation and analytics for international-scale citizen science datasets10.1038/s41597-021-01071-x2052-4463https://doaj.org/article/4657a31169ad4964afaba25bc27b394f2021-11-01T00:00:00Zhttps://doi.org/10.1038/s41597-021-01071-xhttps://doaj.org/toc/2052-4463Abstract The Covid Symptom Study, a smartphone-based surveillance study on COVID-19 symptoms in the population, is an exemplar of big data citizen science. As of May 23rd, 2021, over 5 million participants have collectively logged over 360 million self-assessment reports since its introduction in March 2020. The success of the Covid Symptom Study creates significant technical challenges around effective data curation. The primary issue is scale. The size of the dataset means that it can no longer be readily processed using standard Python-based data analytics software such as Pandas on commodity hardware. Alternative technologies exist but carry a higher technical complexity and are less accessible to many researchers. We present ExeTera, a Python-based open source software package designed to provide Pandas-like data analytics on datasets that approach terabyte scales. We present its design and capabilities, and show how it is a critical component of a data curation pipeline that enables reproducible research across an international research group for the Covid Symptom Study.Benjamin MurrayEric KerfootLiyuan ChenJie DengMark S. GrahamCarole H. SudreErika MolteniLiane S. CanasMichela AntonelliKerstin KlaserAlessia ViscontiAlexander HammersAndrew T. ChanPaul W. FranksRichard DaviesJonathan WolfTim D. SpectorClaire J. StevesMarc ModatSebastien OurselinNature PortfolioarticleScienceQENScientific Data, Vol 8, Iss 1, Pp 1-17 (2021)
institution DOAJ
collection DOAJ
language EN
topic Science
Q
spellingShingle Science
Q
Benjamin Murray
Eric Kerfoot
Liyuan Chen
Jie Deng
Mark S. Graham
Carole H. Sudre
Erika Molteni
Liane S. Canas
Michela Antonelli
Kerstin Klaser
Alessia Visconti
Alexander Hammers
Andrew T. Chan
Paul W. Franks
Richard Davies
Jonathan Wolf
Tim D. Spector
Claire J. Steves
Marc Modat
Sebastien Ourselin
Accessible data curation and analytics for international-scale citizen science datasets
description Abstract The Covid Symptom Study, a smartphone-based surveillance study on COVID-19 symptoms in the population, is an exemplar of big data citizen science. As of May 23rd, 2021, over 5 million participants have collectively logged over 360 million self-assessment reports since its introduction in March 2020. The success of the Covid Symptom Study creates significant technical challenges around effective data curation. The primary issue is scale. The size of the dataset means that it can no longer be readily processed using standard Python-based data analytics software such as Pandas on commodity hardware. Alternative technologies exist but carry a higher technical complexity and are less accessible to many researchers. We present ExeTera, a Python-based open source software package designed to provide Pandas-like data analytics on datasets that approach terabyte scales. We present its design and capabilities, and show how it is a critical component of a data curation pipeline that enables reproducible research across an international research group for the Covid Symptom Study.
format article
author Benjamin Murray
Eric Kerfoot
Liyuan Chen
Jie Deng
Mark S. Graham
Carole H. Sudre
Erika Molteni
Liane S. Canas
Michela Antonelli
Kerstin Klaser
Alessia Visconti
Alexander Hammers
Andrew T. Chan
Paul W. Franks
Richard Davies
Jonathan Wolf
Tim D. Spector
Claire J. Steves
Marc Modat
Sebastien Ourselin
author_facet Benjamin Murray
Eric Kerfoot
Liyuan Chen
Jie Deng
Mark S. Graham
Carole H. Sudre
Erika Molteni
Liane S. Canas
Michela Antonelli
Kerstin Klaser
Alessia Visconti
Alexander Hammers
Andrew T. Chan
Paul W. Franks
Richard Davies
Jonathan Wolf
Tim D. Spector
Claire J. Steves
Marc Modat
Sebastien Ourselin
author_sort Benjamin Murray
title Accessible data curation and analytics for international-scale citizen science datasets
title_short Accessible data curation and analytics for international-scale citizen science datasets
title_full Accessible data curation and analytics for international-scale citizen science datasets
title_fullStr Accessible data curation and analytics for international-scale citizen science datasets
title_full_unstemmed Accessible data curation and analytics for international-scale citizen science datasets
title_sort accessible data curation and analytics for international-scale citizen science datasets
publisher Nature Portfolio
publishDate 2021
url https://doaj.org/article/4657a31169ad4964afaba25bc27b394f
work_keys_str_mv AT benjaminmurray accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT erickerfoot accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT liyuanchen accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT jiedeng accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT marksgraham accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT carolehsudre accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT erikamolteni accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT lianescanas accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT michelaantonelli accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT kerstinklaser accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT alessiavisconti accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT alexanderhammers accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT andrewtchan accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT paulwfranks accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT richarddavies accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT jonathanwolf accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT timdspector accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT clairejsteves accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT marcmodat accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
AT sebastienourselin accessibledatacurationandanalyticsforinternationalscalecitizensciencedatasets
_version_ 1718408008977874944