Scalable long read self-correction and assembly polishing with multiple sequence alignment

Abstract Third-generation sequencing technologies allow to sequence long reads of tens of kbp, that are expected to solve various problems. However, they display high error rates, currently capped around 10%. Self-correction is thus regularly used in long reads analysis projects. We introduce CONSEN...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Pierre Morisse, Camille Marchet, Antoine Limasset, Thierry Lecroq, Arnaud Lefebvre
Formato: article
Lenguaje:EN
Publicado: Nature Portfolio 2021
Materias:
R
Q
Acceso en línea:https://doaj.org/article/fcd66fffea6c4c26a56a20d13c28f227
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:fcd66fffea6c4c26a56a20d13c28f227
record_format dspace
spelling oai:doaj.org-article:fcd66fffea6c4c26a56a20d13c28f2272021-12-02T14:12:46ZScalable long read self-correction and assembly polishing with multiple sequence alignment10.1038/s41598-020-80757-52045-2322https://doaj.org/article/fcd66fffea6c4c26a56a20d13c28f2272021-01-01T00:00:00Zhttps://doi.org/10.1038/s41598-020-80757-5https://doaj.org/toc/2045-2322Abstract Third-generation sequencing technologies allow to sequence long reads of tens of kbp, that are expected to solve various problems. However, they display high error rates, currently capped around 10%. Self-correction is thus regularly used in long reads analysis projects. We introduce CONSENT, a new self-correction method that relies both on multiple sequence alignment and local de Bruijn graphs. To ensure scalability, multiple sequence alignment computation benefits from a new and efficient segmentation strategy, allowing a massive speedup. CONSENT compares well to the state-of-the-art, and performs better on real Oxford Nanopore data. Specifically, CONSENT is the only method that efficiently scales to ultra-long reads, and allows to process a full human dataset, containing reads reaching up to 1.5 Mbp, in 10 days. Moreover, our experiments show that error correction with CONSENT improves the quality of Flye assemblies. Additionally, CONSENT implements a polishing feature, allowing to correct raw assemblies. Our experiments show that CONSENT is 2-38x times faster than other polishing tools, while providing comparable results. Furthermore, we show that, on a human dataset, assembling the raw data and polishing the assembly is less resource consuming than correcting and then assembling the reads, while providing better results. CONSENT is available at https://github.com/morispi/CONSENT .Pierre MorisseCamille MarchetAntoine LimassetThierry LecroqArnaud LefebvreNature PortfolioarticleMedicineRScienceQENScientific Reports, Vol 11, Iss 1, Pp 1-13 (2021)
institution DOAJ
collection DOAJ
language EN
topic Medicine
R
Science
Q
spellingShingle Medicine
R
Science
Q
Pierre Morisse
Camille Marchet
Antoine Limasset
Thierry Lecroq
Arnaud Lefebvre
Scalable long read self-correction and assembly polishing with multiple sequence alignment
description Abstract Third-generation sequencing technologies allow to sequence long reads of tens of kbp, that are expected to solve various problems. However, they display high error rates, currently capped around 10%. Self-correction is thus regularly used in long reads analysis projects. We introduce CONSENT, a new self-correction method that relies both on multiple sequence alignment and local de Bruijn graphs. To ensure scalability, multiple sequence alignment computation benefits from a new and efficient segmentation strategy, allowing a massive speedup. CONSENT compares well to the state-of-the-art, and performs better on real Oxford Nanopore data. Specifically, CONSENT is the only method that efficiently scales to ultra-long reads, and allows to process a full human dataset, containing reads reaching up to 1.5 Mbp, in 10 days. Moreover, our experiments show that error correction with CONSENT improves the quality of Flye assemblies. Additionally, CONSENT implements a polishing feature, allowing to correct raw assemblies. Our experiments show that CONSENT is 2-38x times faster than other polishing tools, while providing comparable results. Furthermore, we show that, on a human dataset, assembling the raw data and polishing the assembly is less resource consuming than correcting and then assembling the reads, while providing better results. CONSENT is available at https://github.com/morispi/CONSENT .
format article
author Pierre Morisse
Camille Marchet
Antoine Limasset
Thierry Lecroq
Arnaud Lefebvre
author_facet Pierre Morisse
Camille Marchet
Antoine Limasset
Thierry Lecroq
Arnaud Lefebvre
author_sort Pierre Morisse
title Scalable long read self-correction and assembly polishing with multiple sequence alignment
title_short Scalable long read self-correction and assembly polishing with multiple sequence alignment
title_full Scalable long read self-correction and assembly polishing with multiple sequence alignment
title_fullStr Scalable long read self-correction and assembly polishing with multiple sequence alignment
title_full_unstemmed Scalable long read self-correction and assembly polishing with multiple sequence alignment
title_sort scalable long read self-correction and assembly polishing with multiple sequence alignment
publisher Nature Portfolio
publishDate 2021
url https://doaj.org/article/fcd66fffea6c4c26a56a20d13c28f227
work_keys_str_mv AT pierremorisse scalablelongreadselfcorrectionandassemblypolishingwithmultiplesequencealignment
AT camillemarchet scalablelongreadselfcorrectionandassemblypolishingwithmultiplesequencealignment
AT antoinelimasset scalablelongreadselfcorrectionandassemblypolishingwithmultiplesequencealignment
AT thierrylecroq scalablelongreadselfcorrectionandassemblypolishingwithmultiplesequencealignment
AT arnaudlefebvre scalablelongreadselfcorrectionandassemblypolishingwithmultiplesequencealignment
_version_ 1718391762124275712