Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences

The open sharing of genomic data provides an incredibly rich resource for the study of bacterial evolution and function and even anthropogenic activities such as the widespread use of antimicrobials. However, these data consist of genomes assembled with different tools and levels of quality checking...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autores principales:	Grace A. Blackwell, Martin Hunt, Kerri M. Malone, Leandro Lima, Gal Horesh, Blaise T. F. Alako, Nicholas R. Thomson, Zamin Iqbal
Formato:	article
Lenguaje:	EN
Publicado:	Public Library of Science (PLoS) 2021
Materias:	Biology (General) QH301-705.5
Acceso en línea:	https://doaj.org/article/c01063cf412e499ab673a82efdc0d7bf
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

id	oai:doaj.org-article:c01063cf412e499ab673a82efdc0d7bf
record_format	dspace
spelling	oai:doaj.org-article:c01063cf412e499ab673a82efdc0d7bf2021-11-18T05:34:49ZExploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences1544-91731545-7885https://doaj.org/article/c01063cf412e499ab673a82efdc0d7bf2021-11-01T00:00:00Zhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC8577725/?tool=EBIhttps://doaj.org/toc/1544-9173https://doaj.org/toc/1545-7885The open sharing of genomic data provides an incredibly rich resource for the study of bacterial evolution and function and even anthropogenic activities such as the widespread use of antimicrobials. However, these data consist of genomes assembled with different tools and levels of quality checking, and of large volumes of completely unprocessed raw sequence data. In both cases, considerable computational effort is required before biological questions can be addressed. Here, we assembled and characterised 661,405 bacterial genomes retrieved from the European Nucleotide Archive (ENA) in November of 2018 using a uniform standardised approach. Of these, 311,006 did not previously have an assembly. We produced a searchable COmpact Bit-sliced Signature (COBS) index, facilitating the easy interrogation of the entire dataset for a specific sequence (e.g., gene, mutation, or plasmid). Additional MinHash and pp-sketch indices support genome-wide comparisons and estimations of genomic distance. Combined, this resource will allow data to be easily subset and searched, phylogenetic relationships between genomes to be quickly elucidated, and hypotheses rapidly generated and tested. We believe that this combination of uniform processing and variety of search/filter functionalities will make this a resource of very wide utility. In terms of diversity within the data, a breakdown of the 639,981 high-quality genomes emphasised the uneven species composition of the ENA/public databases, with just 20 of the total 2,336 species making up 90% of the genomes. The overrepresented species tend to be acute/common human pathogens, aligning with research priorities at different levels from individual interests to funding bodies and national and global public health agencies. This study presents the first uniformly assembled, comprehensively described and searchable dataset of 661,405 bacterial genomes; this resource will empower more scientists to harness the multitude of data in public sequencing archives, but also reveals the biased composition of these archives, with 90% of the data originating from just 20 species.Grace A. BlackwellMartin HuntKerri M. MaloneLeandro LimaGal HoreshBlaise T. F. AlakoNicholas R. ThomsonZamin IqbalPublic Library of Science (PLoS)articleBiology (General)QH301-705.5ENPLoS Biology, Vol 19, Iss 11 (2021)
institution	DOAJ
collection	DOAJ
language	EN
topic	Biology (General) QH301-705.5
spellingShingle	Biology (General) QH301-705.5 Grace A. Blackwell Martin Hunt Kerri M. Malone Leandro Lima Gal Horesh Blaise T. F. Alako Nicholas R. Thomson Zamin Iqbal Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences
description	The open sharing of genomic data provides an incredibly rich resource for the study of bacterial evolution and function and even anthropogenic activities such as the widespread use of antimicrobials. However, these data consist of genomes assembled with different tools and levels of quality checking, and of large volumes of completely unprocessed raw sequence data. In both cases, considerable computational effort is required before biological questions can be addressed. Here, we assembled and characterised 661,405 bacterial genomes retrieved from the European Nucleotide Archive (ENA) in November of 2018 using a uniform standardised approach. Of these, 311,006 did not previously have an assembly. We produced a searchable COmpact Bit-sliced Signature (COBS) index, facilitating the easy interrogation of the entire dataset for a specific sequence (e.g., gene, mutation, or plasmid). Additional MinHash and pp-sketch indices support genome-wide comparisons and estimations of genomic distance. Combined, this resource will allow data to be easily subset and searched, phylogenetic relationships between genomes to be quickly elucidated, and hypotheses rapidly generated and tested. We believe that this combination of uniform processing and variety of search/filter functionalities will make this a resource of very wide utility. In terms of diversity within the data, a breakdown of the 639,981 high-quality genomes emphasised the uneven species composition of the ENA/public databases, with just 20 of the total 2,336 species making up 90% of the genomes. The overrepresented species tend to be acute/common human pathogens, aligning with research priorities at different levels from individual interests to funding bodies and national and global public health agencies. This study presents the first uniformly assembled, comprehensively described and searchable dataset of 661,405 bacterial genomes; this resource will empower more scientists to harness the multitude of data in public sequencing archives, but also reveals the biased composition of these archives, with 90% of the data originating from just 20 species.
format	article
author	Grace A. Blackwell Martin Hunt Kerri M. Malone Leandro Lima Gal Horesh Blaise T. F. Alako Nicholas R. Thomson Zamin Iqbal
author_facet	Grace A. Blackwell Martin Hunt Kerri M. Malone Leandro Lima Gal Horesh Blaise T. F. Alako Nicholas R. Thomson Zamin Iqbal
author_sort	Grace A. Blackwell
title	Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences
title_short	Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences
title_full	Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences
title_fullStr	Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences
title_full_unstemmed	Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences
title_sort	exploring bacterial diversity via a curated and searchable snapshot of archived dna sequences
publisher	Public Library of Science (PLoS)
publishDate	2021
url	https://doaj.org/article/c01063cf412e499ab673a82efdc0d7bf
work_keys_str_mv	AT graceablackwell exploringbacterialdiversityviaacuratedandsearchablesnapshotofarchiveddnasequences AT martinhunt exploringbacterialdiversityviaacuratedandsearchablesnapshotofarchiveddnasequences AT kerrimmalone exploringbacterialdiversityviaacuratedandsearchablesnapshotofarchiveddnasequences AT leandrolima exploringbacterialdiversityviaacuratedandsearchablesnapshotofarchiveddnasequences AT galhoresh exploringbacterialdiversityviaacuratedandsearchablesnapshotofarchiveddnasequences AT blaisetfalako exploringbacterialdiversityviaacuratedandsearchablesnapshotofarchiveddnasequences AT nicholasrthomson exploringbacterialdiversityviaacuratedandsearchablesnapshotofarchiveddnasequences AT zaminiqbal exploringbacterialdiversityviaacuratedandsearchablesnapshotofarchiveddnasequences
_version_	1718424937507586048

Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences

Ejemplares similares