Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics

Contaminating sequences in public genome databases is a pervasive issue with potentially far-reaching consequences. This problem has attracted much attention in the recent literature and many different tools are now available to detect contaminants. Although these methods are based on diverse algori...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autores principales:	Valérian Lupo, Mick Van Vlierberghe, Hervé Vanderschuren, Frédéric Kerff, Denis Baurain, Luc Cornet
Formato:	article
Lenguaje:	EN
Publicado:	Frontiers Media S.A. 2021
Materias:	sequencing assembly contamination genomes databases NCBI RefSeq Microbiology QR1-502
Acceso en línea:	https://doaj.org/article/7e6ce7bcef7845fbaeea38bed2b9115c
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

id	oai:doaj.org-article:7e6ce7bcef7845fbaeea38bed2b9115c
record_format	dspace
spelling	oai:doaj.org-article:7e6ce7bcef7845fbaeea38bed2b9115c2021-11-05T16:03:17ZContamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics1664-302X10.3389/fmicb.2021.755101https://doaj.org/article/7e6ce7bcef7845fbaeea38bed2b9115c2021-10-01T00:00:00Zhttps://www.frontiersin.org/articles/10.3389/fmicb.2021.755101/fullhttps://doaj.org/toc/1664-302XContaminating sequences in public genome databases is a pervasive issue with potentially far-reaching consequences. This problem has attracted much attention in the recent literature and many different tools are now available to detect contaminants. Although these methods are based on diverse algorithms that can sometimes produce widely different estimates of the contamination level, the majority of genomic studies rely on a single method of detection, which represents a risk of systematic error. In this work, we used two orthogonal methods to assess the level of contamination among National Center for Biotechnological Information Reference Sequence Database (RefSeq) bacterial genomes. First, we applied the most popular solution, CheckM, which is based on gene markers. We then complemented this approach by a genome-wide method, termed Physeter, which now implements a k-folds algorithm to avoid inaccurate detection due to potential contamination of the reference database. We demonstrate that CheckM cannot currently be applied to all available genomes and bacterial groups. While it performed well on the majority of RefSeq genomes, it produced dubious results for 12,326 organisms. Among those, Physeter identified 239 contaminated genomes that had been missed by CheckM. In conclusion, we emphasize the importance of using multiple methods of detection while providing an upgrade of our own detection tool, Physeter, which minimizes incorrect contamination estimates in the context of unavoidably contaminated reference databases.Valérian LupoValérian LupoMick Van VlierbergheHervé VanderschurenFrédéric KerffDenis BaurainLuc CornetLuc CornetFrontiers Media S.A.articlesequencingassemblycontaminationgenomesdatabasesNCBI RefSeqMicrobiologyQR1-502ENFrontiers in Microbiology, Vol 12 (2021)
institution	DOAJ
collection	DOAJ
language	EN
topic	sequencing assembly contamination genomes databases NCBI RefSeq Microbiology QR1-502
spellingShingle	sequencing assembly contamination genomes databases NCBI RefSeq Microbiology QR1-502 Valérian Lupo Valérian Lupo Mick Van Vlierberghe Hervé Vanderschuren Frédéric Kerff Denis Baurain Luc Cornet Luc Cornet Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics
description	Contaminating sequences in public genome databases is a pervasive issue with potentially far-reaching consequences. This problem has attracted much attention in the recent literature and many different tools are now available to detect contaminants. Although these methods are based on diverse algorithms that can sometimes produce widely different estimates of the contamination level, the majority of genomic studies rely on a single method of detection, which represents a risk of systematic error. In this work, we used two orthogonal methods to assess the level of contamination among National Center for Biotechnological Information Reference Sequence Database (RefSeq) bacterial genomes. First, we applied the most popular solution, CheckM, which is based on gene markers. We then complemented this approach by a genome-wide method, termed Physeter, which now implements a k-folds algorithm to avoid inaccurate detection due to potential contamination of the reference database. We demonstrate that CheckM cannot currently be applied to all available genomes and bacterial groups. While it performed well on the majority of RefSeq genomes, it produced dubious results for 12,326 organisms. Among those, Physeter identified 239 contaminated genomes that had been missed by CheckM. In conclusion, we emphasize the importance of using multiple methods of detection while providing an upgrade of our own detection tool, Physeter, which minimizes incorrect contamination estimates in the context of unavoidably contaminated reference databases.
format	article
author	Valérian Lupo Valérian Lupo Mick Van Vlierberghe Hervé Vanderschuren Frédéric Kerff Denis Baurain Luc Cornet Luc Cornet
author_facet	Valérian Lupo Valérian Lupo Mick Van Vlierberghe Hervé Vanderschuren Frédéric Kerff Denis Baurain Luc Cornet Luc Cornet
author_sort	Valérian Lupo
title	Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics
title_short	Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics
title_full	Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics
title_fullStr	Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics
title_full_unstemmed	Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics
title_sort	contamination in reference sequence databases: time for divide-and-rule tactics
publisher	Frontiers Media S.A.
publishDate	2021
url	https://doaj.org/article/7e6ce7bcef7845fbaeea38bed2b9115c
work_keys_str_mv	AT valerianlupo contaminationinreferencesequencedatabasestimefordivideandruletactics AT valerianlupo contaminationinreferencesequencedatabasestimefordivideandruletactics AT mickvanvlierberghe contaminationinreferencesequencedatabasestimefordivideandruletactics AT hervevanderschuren contaminationinreferencesequencedatabasestimefordivideandruletactics AT frederickerff contaminationinreferencesequencedatabasestimefordivideandruletactics AT denisbaurain contaminationinreferencesequencedatabasestimefordivideandruletactics AT luccornet contaminationinreferencesequencedatabasestimefordivideandruletactics AT luccornet contaminationinreferencesequencedatabasestimefordivideandruletactics
_version_	1718444170893328384

Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics

Ejemplares similares