RESCRIPt: Reproducible sequence taxonomy reference database management.

Nucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleoti...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Michael S Robeson, Devon R O'Rourke, Benjamin D Kaehler, Michal Ziemski, Matthew R Dillon, Jeffrey T Foster, Nicholas A Bokulich
Formato: article
Lenguaje:EN
Publicado: Public Library of Science (PLoS) 2021
Materias:
Acceso en línea:https://doaj.org/article/3b3ee68d2d0a49df902bccb26c1813c2
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:3b3ee68d2d0a49df902bccb26c1813c2
record_format dspace
spelling oai:doaj.org-article:3b3ee68d2d0a49df902bccb26c1813c22021-12-02T19:57:58ZRESCRIPt: Reproducible sequence taxonomy reference database management.1553-734X1553-735810.1371/journal.pcbi.1009581https://doaj.org/article/3b3ee68d2d0a49df902bccb26c1813c22021-11-01T00:00:00Zhttps://doi.org/10.1371/journal.pcbi.1009581https://doaj.org/toc/1553-734Xhttps://doaj.org/toc/1553-7358Nucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardization limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a Python 3 software package and QIIME 2 plugin for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases. To highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes. RESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt.Michael S RobesonDevon R O'RourkeBenjamin D KaehlerMichal ZiemskiMatthew R DillonJeffrey T FosterNicholas A BokulichPublic Library of Science (PLoS)articleBiology (General)QH301-705.5ENPLoS Computational Biology, Vol 17, Iss 11, p e1009581 (2021)
institution DOAJ
collection DOAJ
language EN
topic Biology (General)
QH301-705.5
spellingShingle Biology (General)
QH301-705.5
Michael S Robeson
Devon R O'Rourke
Benjamin D Kaehler
Michal Ziemski
Matthew R Dillon
Jeffrey T Foster
Nicholas A Bokulich
RESCRIPt: Reproducible sequence taxonomy reference database management.
description Nucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardization limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a Python 3 software package and QIIME 2 plugin for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases. To highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes. RESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt.
format article
author Michael S Robeson
Devon R O'Rourke
Benjamin D Kaehler
Michal Ziemski
Matthew R Dillon
Jeffrey T Foster
Nicholas A Bokulich
author_facet Michael S Robeson
Devon R O'Rourke
Benjamin D Kaehler
Michal Ziemski
Matthew R Dillon
Jeffrey T Foster
Nicholas A Bokulich
author_sort Michael S Robeson
title RESCRIPt: Reproducible sequence taxonomy reference database management.
title_short RESCRIPt: Reproducible sequence taxonomy reference database management.
title_full RESCRIPt: Reproducible sequence taxonomy reference database management.
title_fullStr RESCRIPt: Reproducible sequence taxonomy reference database management.
title_full_unstemmed RESCRIPt: Reproducible sequence taxonomy reference database management.
title_sort rescript: reproducible sequence taxonomy reference database management.
publisher Public Library of Science (PLoS)
publishDate 2021
url https://doaj.org/article/3b3ee68d2d0a49df902bccb26c1813c2
work_keys_str_mv AT michaelsrobeson rescriptreproduciblesequencetaxonomyreferencedatabasemanagement
AT devonrorourke rescriptreproduciblesequencetaxonomyreferencedatabasemanagement
AT benjamindkaehler rescriptreproduciblesequencetaxonomyreferencedatabasemanagement
AT michalziemski rescriptreproduciblesequencetaxonomyreferencedatabasemanagement
AT matthewrdillon rescriptreproduciblesequencetaxonomyreferencedatabasemanagement
AT jeffreytfoster rescriptreproduciblesequencetaxonomyreferencedatabasemanagement
AT nicholasabokulich rescriptreproduciblesequencetaxonomyreferencedatabasemanagement
_version_ 1718375771212349440