A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples
Next-Generation Sequencing (NGS) is widely used to investigate genomic variation. In several studies, the genetic variation of Mycobacterium tuberculosis has been analyzed in sputum samples without previous culture, using target enrichment methodologies for NGS. Alignments obtained by different prog...
Guardado en:
Autores principales: | , , , , , , , , , , , |
---|---|
Formato: | article |
Lenguaje: | EN |
Publicado: |
Public Library of Science (PLoS)
2021
|
Materias: | |
Acceso en línea: | https://doaj.org/article/fa41656a32554f49b0483c010fedc0f6 |
Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
id |
oai:doaj.org-article:fa41656a32554f49b0483c010fedc0f6 |
---|---|
record_format |
dspace |
spelling |
oai:doaj.org-article:fa41656a32554f49b0483c010fedc0f62021-11-04T06:19:43ZA bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples1932-6203https://doaj.org/article/fa41656a32554f49b0483c010fedc0f62021-01-01T00:00:00Zhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC8547644/?tool=EBIhttps://doaj.org/toc/1932-6203Next-Generation Sequencing (NGS) is widely used to investigate genomic variation. In several studies, the genetic variation of Mycobacterium tuberculosis has been analyzed in sputum samples without previous culture, using target enrichment methodologies for NGS. Alignments obtained by different programs generally map the sequences under default parameters, and from these results, it is assumed that only Mycobacterium reads will be obtained. However, variants of interest microorganism in clinical samples can be confused with a vast collection of reads from other bacteria, viruses, and human DNA. Currently, there are no standardized pipelines, and the cleaning success is never verified since there is a lack of rigorous controls to identify and remove reads from other sputum-microorganisms genetically similar to M. tuberculosis. Therefore, we designed a bioinformatic pipeline to process NGS data from sputum samples, including several filters and quality control points to identify and eliminate non-M. tuberculosis reads to obtain a reliable genetic variant report. Our proposal uses the SURPI software as a taxonomic classifier to filter input sequences and perform a mapping that provides the highest percentage of Mycobacterium reads, minimizing the reads from other microorganisms. We then use the filtered sequences to perform variant calling with the GATK software, ensuring the mapping quality, realignment, recalibration, hard-filtering, and post-filter to increase the reliability of the reported variants. Using default mapping parameters, we identified reads of contaminant bacteria, such as Streptococcus, Rhotia, Actinomyces, and Veillonella. Our final mapping strategy allowed a sequence identity of 97.8% between the input reads and the whole M. tuberculosis reference genome H37Rv using a genomic edit distance of three, thus removing 98.8% of the off-target sequences with a Mycobacterium reads loss of 1.7%. Finally, more than 200 unreliable genetic variants were removed during the variant calling, increasing the report’s reliability.Betzaida Cuevas-CórdobaCristóbal FresnoJoshua I. Haase-HernándezMartín Barbosa-AmezcuaMinerva Mata-RochaMarcela Muñoz-TorricoMiguel A. Salazar-LezamaJosé A. Martínez-OrozcoLuis A. Narváez-DíazJorge Salas-HernándezVanessa González-CovarrubiasXavier SoberónPublic Library of Science (PLoS)articleMedicineRScienceQENPLoS ONE, Vol 16, Iss 10 (2021) |
institution |
DOAJ |
collection |
DOAJ |
language |
EN |
topic |
Medicine R Science Q |
spellingShingle |
Medicine R Science Q Betzaida Cuevas-Córdoba Cristóbal Fresno Joshua I. Haase-Hernández Martín Barbosa-Amezcua Minerva Mata-Rocha Marcela Muñoz-Torrico Miguel A. Salazar-Lezama José A. Martínez-Orozco Luis A. Narváez-Díaz Jorge Salas-Hernández Vanessa González-Covarrubias Xavier Soberón A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples |
description |
Next-Generation Sequencing (NGS) is widely used to investigate genomic variation. In several studies, the genetic variation of Mycobacterium tuberculosis has been analyzed in sputum samples without previous culture, using target enrichment methodologies for NGS. Alignments obtained by different programs generally map the sequences under default parameters, and from these results, it is assumed that only Mycobacterium reads will be obtained. However, variants of interest microorganism in clinical samples can be confused with a vast collection of reads from other bacteria, viruses, and human DNA. Currently, there are no standardized pipelines, and the cleaning success is never verified since there is a lack of rigorous controls to identify and remove reads from other sputum-microorganisms genetically similar to M. tuberculosis. Therefore, we designed a bioinformatic pipeline to process NGS data from sputum samples, including several filters and quality control points to identify and eliminate non-M. tuberculosis reads to obtain a reliable genetic variant report. Our proposal uses the SURPI software as a taxonomic classifier to filter input sequences and perform a mapping that provides the highest percentage of Mycobacterium reads, minimizing the reads from other microorganisms. We then use the filtered sequences to perform variant calling with the GATK software, ensuring the mapping quality, realignment, recalibration, hard-filtering, and post-filter to increase the reliability of the reported variants. Using default mapping parameters, we identified reads of contaminant bacteria, such as Streptococcus, Rhotia, Actinomyces, and Veillonella. Our final mapping strategy allowed a sequence identity of 97.8% between the input reads and the whole M. tuberculosis reference genome H37Rv using a genomic edit distance of three, thus removing 98.8% of the off-target sequences with a Mycobacterium reads loss of 1.7%. Finally, more than 200 unreliable genetic variants were removed during the variant calling, increasing the report’s reliability. |
format |
article |
author |
Betzaida Cuevas-Córdoba Cristóbal Fresno Joshua I. Haase-Hernández Martín Barbosa-Amezcua Minerva Mata-Rocha Marcela Muñoz-Torrico Miguel A. Salazar-Lezama José A. Martínez-Orozco Luis A. Narváez-Díaz Jorge Salas-Hernández Vanessa González-Covarrubias Xavier Soberón |
author_facet |
Betzaida Cuevas-Córdoba Cristóbal Fresno Joshua I. Haase-Hernández Martín Barbosa-Amezcua Minerva Mata-Rocha Marcela Muñoz-Torrico Miguel A. Salazar-Lezama José A. Martínez-Orozco Luis A. Narváez-Díaz Jorge Salas-Hernández Vanessa González-Covarrubias Xavier Soberón |
author_sort |
Betzaida Cuevas-Córdoba |
title |
A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples |
title_short |
A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples |
title_full |
A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples |
title_fullStr |
A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples |
title_full_unstemmed |
A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples |
title_sort |
bioinformatics pipeline for mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples |
publisher |
Public Library of Science (PLoS) |
publishDate |
2021 |
url |
https://doaj.org/article/fa41656a32554f49b0483c010fedc0f6 |
work_keys_str_mv |
AT betzaidacuevascordoba abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples AT cristobalfresno abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples AT joshuaihaasehernandez abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples AT martinbarbosaamezcua abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples AT minervamatarocha abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples AT marcelamunoztorrico abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples AT miguelasalazarlezama abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples AT joseamartinezorozco abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples AT luisanarvaezdiaz abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples AT jorgesalashernandez abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples AT vanessagonzalezcovarrubias abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples AT xaviersoberon abioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples AT betzaidacuevascordoba bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples AT cristobalfresno bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples AT joshuaihaasehernandez bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples AT martinbarbosaamezcua bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples AT minervamatarocha bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples AT marcelamunoztorrico bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples AT miguelasalazarlezama bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples AT joseamartinezorozco bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples AT luisanarvaezdiaz bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples AT jorgesalashernandez bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples AT vanessagonzalezcovarrubias bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples AT xaviersoberon bioinformaticspipelineformycobacteriumtuberculosissequencingthatcleanscontaminantreadsfromsputumsamples |
_version_ |
1718445128581906432 |