A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples

Next-Generation Sequencing (NGS) is widely used to investigate genomic variation. In several studies, the genetic variation of Mycobacterium tuberculosis has been analyzed in sputum samples without previous culture, using target enrichment methodologies for NGS. Alignments obtained by different prog...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Betzaida Cuevas-Córdoba, Cristóbal Fresno, Joshua I. Haase-Hernández, Martín Barbosa-Amezcua, Minerva Mata-Rocha, Marcela Muñoz-Torrico, Miguel A. Salazar-Lezama, José A. Martínez-Orozco, Luis A. Narváez-Díaz, Jorge Salas-Hernández, Vanessa González-Covarrubias, Xavier Soberón
Formato: article
Lenguaje:EN
Publicado: Public Library of Science (PLoS) 2021
Materias:
R
Q
Acceso en línea:https://doaj.org/article/fa41656a32554f49b0483c010fedc0f6
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:fa41656a32554f49b0483c010fedc0f6
record_format dspace
spelling oai:doaj.org-article:fa41656a32554f49b0483c010fedc0f62021-11-04T06:19:43ZA bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples1932-6203https://doaj.org/article/fa41656a32554f49b0483c010fedc0f62021-01-01T00:00:00Zhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC8547644/?tool=EBIhttps://doaj.org/toc/1932-6203Next-Generation Sequencing (NGS) is widely used to investigate genomic variation. In several studies, the genetic variation of Mycobacterium tuberculosis has been analyzed in sputum samples without previous culture, using target enrichment methodologies for NGS. Alignments obtained by different programs generally map the sequences under default parameters, and from these results, it is assumed that only Mycobacterium reads will be obtained. However, variants of interest microorganism in clinical samples can be confused with a vast collection of reads from other bacteria, viruses, and human DNA. Currently, there are no standardized pipelines, and the cleaning success is never verified since there is a lack of rigorous controls to identify and remove reads from other sputum-microorganisms genetically similar to M. tuberculosis. Therefore, we designed a bioinformatic pipeline to process NGS data from sputum samples, including several filters and quality control points to identify and eliminate non-M. tuberculosis reads to obtain a reliable genetic variant report. Our proposal uses the SURPI software as a taxonomic classifier to filter input sequences and perform a mapping that provides the highest percentage of Mycobacterium reads, minimizing the reads from other microorganisms. We then use the filtered sequences to perform variant calling with the GATK software, ensuring the mapping quality, realignment, recalibration, hard-filtering, and post-filter to increase the reliability of the reported variants. Using default mapping parameters, we identified reads of contaminant bacteria, such as Streptococcus, Rhotia, Actinomyces, and Veillonella. Our final mapping strategy allowed a sequence identity of 97.8% between the input reads and the whole M. tuberculosis reference genome H37Rv using a genomic edit distance of three, thus removing 98.8% of the off-target sequences with a Mycobacterium reads loss of 1.7%. Finally, more than 200 unreliable genetic variants were removed during the variant calling, increasing the report’s reliability.Betzaida Cuevas-CórdobaCristóbal FresnoJoshua I. Haase-HernándezMartín Barbosa-AmezcuaMinerva Mata-RochaMarcela Muñoz-TorricoMiguel A. Salazar-LezamaJosé A. Martínez-OrozcoLuis A. Narváez-DíazJorge Salas-HernándezVanessa González-CovarrubiasXavier SoberónPublic Library of Science (PLoS)articleMedicineRScienceQENPLoS ONE, Vol 16, Iss 10 (2021)
institution DOAJ
collection DOAJ
language EN
topic Medicine
R
Science
Q
spellingShingle Medicine
R
Science
Q
Betzaida Cuevas-Córdoba
Cristóbal Fresno
Joshua I. Haase-Hernández
Martín Barbosa-Amezcua
Minerva Mata-Rocha
Marcela Muñoz-Torrico
Miguel A. Salazar-Lezama
José A. Martínez-Orozco
Luis A. Narváez-Díaz
Jorge Salas-Hernández
Vanessa González-Covarrubias
Xavier Soberón
A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples
description Next-Generation Sequencing (NGS) is widely used to investigate genomic variation. In several studies, the genetic variation of Mycobacterium tuberculosis has been analyzed in sputum samples without previous culture, using target enrichment methodologies for NGS. Alignments obtained by different programs generally map the sequences under default parameters, and from these results, it is assumed that only Mycobacterium reads will be obtained. However, variants of interest microorganism in clinical samples can be confused with a vast collection of reads from other bacteria, viruses, and human DNA. Currently, there are no standardized pipelines, and the cleaning success is never verified since there is a lack of rigorous controls to identify and remove reads from other sputum-microorganisms genetically similar to M. tuberculosis. Therefore, we designed a bioinformatic pipeline to process NGS data from sputum samples, including several filters and quality control points to identify and eliminate non-M. tuberculosis reads to obtain a reliable genetic variant report. Our proposal uses the SURPI software as a taxonomic classifier to filter input sequences and perform a mapping that provides the highest percentage of Mycobacterium reads, minimizing the reads from other microorganisms. We then use the filtered sequences to perform variant calling with the GATK software, ensuring the mapping quality, realignment, recalibration, hard-filtering, and post-filter to increase the reliability of the reported variants. Using default mapping parameters, we identified reads of contaminant bacteria, such as Streptococcus, Rhotia, Actinomyces, and Veillonella. Our final mapping strategy allowed a sequence identity of 97.8% between the input reads and the whole M. tuberculosis reference genome H37Rv using a genomic edit distance of three, thus removing 98.8% of the off-target sequences with a Mycobacterium reads loss of 1.7%. Finally, more than 200 unreliable genetic variants were removed during the variant calling, increasing the report’s reliability.
format article
author Betzaida Cuevas-Córdoba
Cristóbal Fresno
Joshua I. Haase-Hernández
Martín Barbosa-Amezcua
Minerva Mata-Rocha
Marcela Muñoz-Torrico
Miguel A. Salazar-Lezama
José A. Martínez-Orozco
Luis A. Narváez-Díaz
Jorge Salas-Hernández
Vanessa González-Covarrubias
Xavier Soberón
author_facet Betzaida Cuevas-Córdoba
Cristóbal Fresno
Joshua I. Haase-Hernández
Martín Barbosa-Amezcua
Minerva Mata-Rocha
Marcela Muñoz-Torrico
Miguel A. Salazar-Lezama
José A. Martínez-Orozco
Luis A. Narváez-Díaz
Jorge Salas-Hernández
Vanessa González-Covarrubias
Xavier Soberón
author_sort Betzaida Cuevas-Córdoba
title A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples
title_short A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples
title_full A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples
title_fullStr A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples
title_full_unstemmed A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples
title_sort bioinformatics pipeline for mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples
publisher Public Library of Science (PLoS)
publishDate 2021
url https://doaj.org/article/fa41656a32554f49b0483c010fedc0f6
work_keys_str_mv AT betzaidacuevascordoba abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT cristobalfresno abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT joshuaihaasehernandez abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT martinbarbosaamezcua abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT minervamatarocha abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT marcelamunoztorrico abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT miguelasalazarlezama abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT joseamartinezorozco abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT luisanarvaezdiaz abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT jorgesalashernandez abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT vanessagonzalezcovarrubias abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT xaviersoberon abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT betzaidacuevascordoba bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT cristobalfresno bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT joshuaihaasehernandez bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT martinbarbosaamezcua bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT minervamatarocha bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT marcelamunoztorrico bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT miguelasalazarlezama bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT joseamartinezorozco bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT luisanarvaezdiaz bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT jorgesalashernandez bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT vanessagonzalezcovarrubias bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
AT xaviersoberon bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples
_version_ 1718445128581906432